Uploaded by SAMBIT MISHRA

Microsoft BI - SQL Server

advertisement
ITPro
™
SERIES
Books
A
Jump Start
to
SQL Server BI
Don Awalt
Larry Barnes
Alexei Bocharov
Herts Chen
Rick Dobson
Rob Ericsson
Kirk Haselden
Brian Lawton
Jesper Lind
Tim Ramey
Paul Sanders
Mark D. Scott
David Walls
Russ Whitney
i
Contents
Section I: Essential BI Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Chapter 1: Data Warehousing: Back to Basics . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
By Don Awalt, Brian Lawton
Common Terms . . . . . . . . . .
Establishing a Vision . . . . . .
Defining Scope . . . . . . . . .
The Essence of Warehousing
The Rest Is Up to You . . . .
.
.
.
.
.
2
3
3
4
5
Chapter 2: 7 Steps to Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
by Mark D. Scott, David Walls
Step 1: Determine Business Objectives . . . . . . . . . . . . . .
Step 2: Collect and Analyze Information . . . . . . . . . . . . .
Step 3: Identify Core Business Processes . . . . . . . . . . . . .
Step 4: Construct a Conceptual Data Model . . . . . . . . . . .
Step 5: Locate Data Sources and Plan Data Transformations
Step 6: Set Tracking Duration . . . . . . . . . . . . . . . . . . . . .
Step 7: Implement the Plan . . . . . . . . . . . . . . . . . . . . . . .
6
7
7
8
8
9
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 3: The Art of Cube Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
By Russ Whitney, Tim Ramey
Designing a Sales-Forecasting Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Providing Valid Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
12
Chapter 4: DTS 2000 in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
By Larry Barnes
Introducing the Create FoodMart 2000 Package . . .
Initializing Global Variables and the Package State
Preparing the Existing Environment . . . . . . . . . .
Creating the FoodMart Database and Tables . . . .
Change Is Good . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
20
25
36
Chapter 5: Rock-Solid MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
By Russ Whitney
ii
A Jump Start to SQL Server BI
Chapter 6: XML for Analysis: Marrying OLAP and Web Services . . . . . . . . . . . . . 45
By Rob Ericsson
Installing XMLA . . . . . . . . . . . . . .
Using XMLA: Discover and Execute
Getting Results . . . . . . . . . . . . . .
A Convenient Marriage . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
48
52
56
Chapter 7: Improving Analysis Services Query Performance . . . . . . . . . . . . . . . 57
By Herts Chen
Traffic-Accident Data Warehouse
Queries and Bottlenecks . . . . . .
Usage-Based Partitioning . . . . .
Partition Testing . . . . . . . . . . . .
Guidelines for Partitioning . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
58
61
62
64
Chapter 8: Reporting Services 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
By Rick Dobson
Installing Reporting Services . . . .
Creating Your First Report . . . . .
Creating a Drilldown Report . . . .
Deploying a Solution . . . . . . . . .
Viewing Deployed Solution Items
Beyond the Basics . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
68
70
70
72
Section II – BI Tips and Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Improve Performance at the Aggregation Level . . . . . . .
Using Children to Automatically Update Products . . . . .
Saving DTS Information to a Repository . . . . . . . . . . . .
Intelligent Business . . . . . . . . . . . . . . . . . . . . . . . . . .
Techniques for Creating Custom Aggregations . . . . . . . .
Using Loaded Measures to Customize Aggregations . . . .
Caution: Large Dimensions Ahead . . . . . . . . . . . . . . . .
Decoding MDX Secrets . . . . . . . . . . . . . . . . . . . . . . . .
Improve Cube Processing by Creating a Time Dimension
Transforming Data with DTS . . . . . . . . . . . . . . . . . . . .
Supporting Disconnected Users . . . . . . . . . . . . . . . . . .
Dependency Risk Analysis . . . . . . . . . . . . . . . . . . . . .
Choosing the Right Client for the Task . . . . . . . . . . . . .
Using Access as a Data Source . . . . . . . . . . . . . . . . . .
Calculating Utilization . . . . . . . . . . . . . . . . . . . . . . . . .
Use Member Properties Judiciously . . . . . . . . . . . . . . .
Get Level Names Right from the Get-Go . . . . . . . . . . .
Aggregating a Selected Group of Members . . . . . . . . . .
Determining the Percentage of a Product’s Contribution .
.....
.....
.....
.....
.....
.....
.....
.....
Table
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
74
74
75
76
76
77
77
78
80
81
82
83
83
83
85
85
85
86
iii
Avoid Crippled Client Software . . . . . . . . . . . . .
Setting OLAP Cube Aggregation Options . . . . . .
Use Views as the Data Source . . . . . . . . . . . . . .
Enter Count Estimates . . . . . . . . . . . . . . . . . . .
Using Dynamic Properties to Stabilize DTS . . . . .
Leave Snowflakes Alone . . . . . . . . . . . . . . . . . .
Create Grouping Levels Manually . . . . . . . . . . .
Understand the Role of MDX . . . . . . . . . . . . . .
Using NON EMPTY to Include Empty Cells . . . .
Formatting Financial Reports . . . . . . . . . . . . . . .
Analyzing Store Revenue . . . . . . . . . . . . . . . . .
Use Counts to Analyze Textual Information . . . .
Consolidation Analysis . . . . . . . . . . . . . . . . . . .
Working with Analysis Services Programmatically
Filtering on Member Properties in SQL Server 7.0
Improving Query Performance . . . . . . . . . . . . .
Using SQL ALIAS for the AS/400 . . . . . . . . . . . .
Setting Up English Query . . . . . . . . . . . . . . . . .
When Do You Use Web Services? . . . . . . . . . . .
The Security Connection . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
87
87
88
88
88
89
89
89
90
90
91
92
93
93
94
95
96
96
96
Section III – New BI Features in SQL Server 2005 . . . . . . . . . . . . . . . . . . 98
Chapter 1: Building Better BI in SQL Server 2005 . . . . . . . . . . . . . . . . . . . . . . . . 99
How are SQL Server 2005’s BI enhancements meeting Microsoft’s goals for
serving the BI community? And how long has your team has been working on
these enhancements? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What kind of feedback have you been getting from beta testers, and which
features are they most enthusiastic about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
According to news reports, Microsoft and some large customers have deployed
SQL Server 2005 Beta 2 in production environments. What is your recommendation
for deploying Beta 2 and running it in production? What caveats do you have
for businesses eager to move to the new version now? . . . . . . . . . . . . . . . . . . . . .
How compatible are SQL Server 2000’s BI tools (OLAP, DTS, data mining) and
SQL Server 2005’s new BI tools? Because some of SQL Server 2005’s BI tools—
such as Integration Services—are completely rewritten, will they still work with
SQL Server 2000 data and packages? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SQL Server 2000 Analysis Services supports only clustering and decision-tree
data-mining algorithms. Does SQL Server 2005 add support for other algorithms? . . .
Microsoft relies on an integrated technology stack—from OS to database to
user interface. How does that integration help Microsoft’s BI offerings better
serve your customers’ needs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
99
.
99
. 100
. 100
. 100
. 100
iv
A Jump Start to SQL Server BI
SQL Server 2005 will be the first release in which database tools converge
with Visual Studio development tools. Can you tell us what it took to align
these two releases and what benefits customers will realize from the change? . . .
The introduction of the UDM is said to blur the line between relational and
multidimensional database architectures. This approach is new for the
Microsoft BI platform. What are the most interesting features the UDM offers?
And based on your experience, what features do you think will surface as the
most valuable for customers and ISVs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What tools will Microsoft add to the Visual Studio 2005 IDE to help developers
create and manage SQL Server (and other database platforms’) users, groups,
and permissions to better insulate private data from those who shouldn’t
have access? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
In one of your past conference keynote addresses, you mentioned that Microsoft
is adding a new set of controls to Visual Studio 2005 to permit reporting without
Reporting Services. Could you describe what those controls will do, when we’ll
see the controls appear in Visual Studio 2005, and where you expect them to
be documented? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What benefit does 64-bit bring to SQL Server BI, and do you think 64-bit can
really help the Microsoft BI platform scale to the levels that UNIX-based
BI platforms scale to today? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Who are some BI vendors you’re working closely with to develop 64-bit
BI computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Did you leave out any BI features that you planned to add to SQL Server 2005
because of deadlines or other issues? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Your team puts a lot of long hours into your work on SQL Server BI.
What drives you and your BI developers to invest so much personally in
the product? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 101
. . . 101
. . . 101
. . . 102
. . . 102
. . . 102
. . . 102
. . . 102
Chapter 2: UDM: The Best of Both Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
By Paul Sanders
The UDM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
One Model for Reporting and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Chapter 3: Data Mining Reloaded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
By Alexei Bocharov, Jesper Lind
Mining the Data . . . . . . . . . . . . . . . . . .
Prediction and Mutual Prediction . . . . . . .
Decision Trees . . . . . . . . . . . . . . . . . . .
Time Series . . . . . . . . . . . . . . . . . . . . .
Clustering and Sequence Clustering . . . . .
Naive Bayes Models and Neural Networks
Association Rules . . . . . . . . . . . . . . . . . .
Third-Party Algorithms (Plug-Ins) . . . . . . .
Dig In . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
109
111
112
113
115
116
117
118
118
Chapter 4: What’s New in DTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
By Kirk Haselden
SQL Server 2005 DTS Design Goals . .
Redesigning the Designer . . . . . . . . .
Migration Pain . . . . . . . . . . . . . . . .
Fresh Faces, SDK, and Other Support
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
119
122
125
126
1
Section I
Essential BI Concepts
Brought to you by Microsoft and Windows IT Pro eBooks
2
A Jump Start to SQL Server BI
Chapter 1:
Data Warehousing: Back to Basics
By Don Awalt, Brian Lawton
So, you’re about to undertake your first data-warehousing project. Where will you begin? Or maybe
you’re already implementing a warehouse, but the project is going awry and you’re trying to get it
back on track. What do you need to know to make it successful? Let’s step back from the implementation details and examine some analysis and design roadblocks you need to overcome on your road
to a successful data warehouse deployment. Along the way, we’ll review the common terminology
you need to understand and discuss some challenges you’ll face on your quest. Following these
guidelines can boost your chances for a successful data warehouse deployment.
Common Terms
First, let’s define the crucial pieces of the project: a data warehouse, a data mart, and data
warehousing. Although they’re often used interchangeably, each has a distinct meaning and impact
on the project. A data warehouse is the cohesive data model that defines the central data repository
for an organization. An important point is that we don’t define a warehouse in terms of the number
of databases. Instead, we consider it a complete, integrated data model of the enterprise, regardless of
how or where the data is stored.
A data mart is a repository containing data specific to a particular business group in an
enterprise. All data in a data mart derives from the data warehouse, and all data relates directly to the
enterprisewide data model. Often, data marts contain summarized or aggregated data that the user
community can easily consume. Another way to differentiate a data warehouse from a data mart is to
look at the data’s consumers and format. IT analysts and canned reporting utilities consume warehouse data, whose storage is usually coded and cryptic. The user community consumes data mart
data, whose storage is usually in a more readable format. For example, to reduce the need for
complex queries and assist business users who might be uncomfortable with the SQL language, data
tables could contain the denormalized code table values.
Finally, data warehousing is the process of managing the data warehouse and data marts. This
process includes all the ongoing support needs of the refresh cycle, database maintenance, and
continual refinements to the underlying data model.
One important aspect of developing a warehouse is having a data dictionary that both the project
team and the user community use to derive definitions and understand data elements within the
warehouse. This statement seems simple, but when you’re pulling data together from many source
systems, a common implementation problem (which people usually don’t identify until after
deployment) involves reconciling similarly named data elements that come from different systems and
have subtle differences in meaning. An example of this problem in the health care community is the
attribute attending physician. One system, which tracks daily patient activity, defines this term as the
physician currently responsible for the patient’s care. At the same facility, a system that focuses on
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 1 Data Warehousing: Back to Basics 3
patient billing defines it as the physician most affiliated with the visit. Both definitions are correct in
their contexts, but the difference illustrates a challenge in trying to combine the two systems’ data.
The health care example illustrates a symptom of what we consider the biggest challenge in a
data-warehousing project: bringing together the right people from the user and development
communities to create a project team. The right people have the business knowledge to explain how
the data is used, the systems knowledge to extract it, and the analytical and design skills to bring it
together into a warehouse. The difference between other projects and building a warehouse is that
individual projects usually focus on one business area, whereas building a data warehouse focuses on
combining the data and subsequent knowledge from many projects. The team must have the depth
and breadth to cover all the systems involved.
Establishing a Vision
Now that we’ve identified the largest risk area, let’s look at some steps you can take to minimize the
risk. To put together the right project team, first define the project’s vision and begin to establish its
scope. After you do so, you’ll see more clearly which users and IT staff members you need to
involve.
The vision’s purpose is to define the project’s ultimate mission from a business perspective. In
theory, all work on the project directly or indirectly supports the objectives outlined in the vision.
Defining a clear, tangible mission for the project is crucial. When articulated properly, the vision
defines relative priorities for the team—schedule, features, and cost. You use it to resolve requirement
and implementation decisions throughout the development lifecycle: Tailor your decisions to support
the mission and priorities of the vision; omit or defer others to later iterations. The vision creates a
theme for the project that serves the entire project development cycle. At the highest level, you
require all project activities to achieve the vision’s objectives.
For example, let’s look at a growing health care organization in which each facility maintains a
separate information system. The vision for its warehousing project might be to provide the ability to
review, analyze, and track historical data across all facilities in an appropriate and meaningful context.
This vision describes an objective that implementing a data warehouse can accomplish.
Defining Scope
After you’ve established the project’s vision, you can set its scope. Next to fielding the wrong team,
the inability to define the right scope puts a project at most risk for failure. Scope refers to the
potential size of the undertaking—what will be delivered successfully in a meaningful time frame.
Often a warehousing project tries to deliver too much, which can result in the project falling
dramatically behind schedule or even being canceled. The other extreme, building stovepipes,
happens when an organization decides to use many small databases to focus on discrete business
areas. Although these combined databases might look like a data warehouse, they’re really dataaccess enhancements (or reporting enhancements) to the operational systems. This implementation
isn’t a true data warehouse because stovepipes are independent units with no cohesive data model
tying them together. In the context of data warehousing, stovepipes achieve no enterprise-level
business objectives.
Understanding the definitions we gave earlier is important for arriving at the right scope for the
project. Although, by definition, the data warehouse takes into consideration the entire business, you
Brought to you by Microsoft and Windows IT Pro eBooks
4
A Jump Start to SQL Server BI
don’t need to implement it all at once. When you focus on individual business units within the
overall model, design and development proceed iteratively, and you implement one or two areas at a
time. Iterative development results in a faster return on investment when you prioritize the business
area development, rather than waiting to roll out one massive warehouse at the end. From a scope
perspective, you control the size, timing, and cost of each iteration without compromising the
integrity of the overall data warehouse.
An often-overlooked aspect of the project is building the infrastructure to sustain the process of
data warehousing. All too many warehousing projects break down after deployment because people
fail to recognize the ongoing support costs (resources, time, and coordination) of refreshing the data.
You might have designed the world’s best data model and implemented a great database, but if users
don’t receive data in a reliable and timely manner, they’ll consider the project a failure. Depending
on the volatility of the source system data, warehouse data can quickly go stale. To determine the
warehouse’s refresh intervals, you must have project requirements that identify the rate of change in
the source system data and how often the user community needs to see those changes reflected.
Our experience shows that building the appropriate infrastructure to support the data
warehousing aspect of the project is as important as designing the data model. So factor the ongoing
support needs and the corresponding infrastructure development costs (e.g., to sustain the timely
refresh of the data) into the project’s scope.
The Essence of Warehousing
So far, we’ve focused on some of the project-planning issues and high-level design considerations
involved in building a warehouse. Now it’s time to examine the essence of data warehousing: data
acquisition, data transformation, and data presentation. These areas constitute the ongoing process of
data warehousing and require a full understanding to avoid data refresh problems.
Data acquisition is the task of bringing data from everywhere to the data warehouse. Most
businesses have several operational systems that handle the organization’s day-to-day processing.
These systems serve as the data source for the warehouse. The systems might reside on a mainframe,
in a client/server database application, in a third-party application with a proprietary data store, within
desktop applications such as spreadsheets and database applications, or any combination of these.
The challenge is to identify the data sources and develop a solution for extracting and delivering the
data to the warehouse in a timely, scheduled manner.
After collecting the data, you need to transform it. In an ideal organization, all systems would
use the same set of codes and definitions for all data elements. In the real world, as we showed
earlier, different codes and definitions exist for what appear to be the same data element. Data
transformation is the cleansing and validation of data for accuracy, and ensuring that all values
conform to a standard definition. After these data transformation tasks are complete, you can add the
data to the warehouse.
Finally, you’re ready for data presentation. At this point, the warehouse contains a large,
normalized data store containing all (or part) of the organization’s data. Great! Unfortunately, the
users who need this data can’t make sense of it because of its cryptic coding schemes and
normalized storage. Data presentation involves taking the data from the data warehouse and getting it
into the hands of users in a usable, easy-to-understand format. One way to present the data is to
deploy a data mart containing summarized, aggregated data. Or you can put an OLAP engine
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 1 Data Warehousing: Back to Basics 5
between the warehouse and the user. Another option is to custom-build a reporting tool or deploy
third-party solutions. Identify the most effective way to present the data, and implement it.
In completing these tasks, keep in mind that the data that users receive needs to be consistent,
accurate, and timely. Failure to ensure quality data delivery could jeopardize the project’s success
because users won’t work with inaccurate or old data. One way to minimize the risks of bad data is
to involve users in the cleansing, validation, and transformation steps of the data transformation task.
The more input and familiarity users have with the data validations and transformations, the more
confident they’ll be about the accuracy of the resulting warehouse data. Also, emphasize to the users
the importance of their input into the data validation process. Explain to them that their experience
and knowledge make them a necessary part of the project team and ensure the data’s validity and
integrity.
The Rest Is Up to You
So as you embark on your data warehousing adventure, remember these basic ideas. Carefully define
the project’s vision and the scope of the first iteration. Inform and involve your users. Know and
understand the three major tasks of implementation—data acquisition, data transformation, and data
presentation. Finally, during design always keep in mind the consistency, accuracy, and timeliness of
the ongoing data delivery. Although we can’t guarantee that your warehousing project won’t fail,
following the basics discussed here will give you a much better chance of success.
Brought to you by Microsoft and Windows IT Pro eBooks
6
A Jump Start to SQL Server BI
Chapter 2:
7 Steps to Data Warehousing
By Mark D. Scott, David Walls
Data warehousing is a business analyst’s dream—all the information about the organization’s activities
gathered in one place, open to a single set of analytical tools. But how do you make the dream a
reality? First, you have to plan your data warehouse system. You must understand what questions
users will ask it (e.g., how many registrations did the company receive in each quarter, or what
industries are purchasing custom software development in the Northeast) because the purpose of a
data warehouse system is to provide decision-makers the accurate, timely information they need to
make the right choices.
To illustrate the process, we’ll use a data warehouse we designed for a custom software
development, consulting, staffing, and training company. The company’s market is rapidly changing,
and its leaders need to know what adjustments in their business model and sales practices will help
the company continue to grow. To assist the company, we worked with the senior management staff
to design a solution. First, we determined the business objectives for the system. Then we collected
and analyzed information about the enterprise. We identified the core business processes that the
company needed to track, and constructed a conceptual model of the data. Then we located the data
sources and planned data transformations. Finally, we set the tracking duration.
Step 1: Determine Business Objectives
The company is in a phase of rapid growth and will need the proper mix of administrative, sales,
production, and support personnel. Key decision-makers want to know whether increasing overhead
staffing is returning value to the organization. As the company enhances the sales force and employs
different sales modes, the leaders need to know whether these modes are effective. External market
forces are changing the balance between a national and regional focus, and the leaders need to
understand this change’s effects on the business.
To answer the decision-makers’ questions, we needed to understand what defines success for this
business. The owner, the president, and four key managers oversee the company. These managers
oversee profit centers and are responsible for making their areas successful. They also share
resources, contacts, sales opportunities, and personnel. The managers examine different factors to
measure the health and growth of their segments. Gross profit interests everyone in the group, but to
make decisions about what generates that profit, the system must correlate more details. For instance,
a small contract requires almost the same amount of administrative overhead as a large contract.
Thus, many smaller contracts generate revenue at less profit than a few large contracts. Tracking
contract size becomes important for identifying the factors that lead to larger contracts.
As we worked with the management team, we learned the quantitative measurements of
business activity that decision-makers use to guide the organization. These measurements are the key
performance indicators, a numeric measure of the company’s activities, such as units sold, gross
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 2 7 Steps to Data Warehousing 7
profit, net profit, hours spent, students taught, and repeat student registrations. We collected the key
performance indicators into a table called a fact table.
Step 2: Collect and Analyze Information
The only way to gather this performance information is to ask questions. The leaders have sources of
information they use to make decisions. Start with these data sources. Many are simple. You can get
reports from the accounting package, the customer relationship management (CRM) application, the
time reporting system, etc. You’ll need copies of all these reports and you’ll need to know where
they come from.
Often, analysts, supervisors, administrative assistants, and others create analytical and summary
reports. These reports can be simple correlations of existing reports, or they can include information
that people overlook with the existing software or information stored in spreadsheets and memos.
Such overlooked information can include logs of telephone calls someone keeps by hand, a small
desktop database that tracks shipping dates, or a daily report a supervisor emails to a manager. A big
challenge for data warehouse designers is finding ways to collect this information. People often write
off this type of serendipitous information as unimportant or inaccurate. But remember that nothing
develops without a reason. Before you disregard any source of information, you need to understand
why it exists.
Another part of this collection and analysis phase is understanding how people gather and
process the information. A data warehouse can automate many reporting tasks, but you can’t
automate what you haven’t identified and don’t understand. The process requires extensive
interaction with the individuals involved. Listen carefully and repeat back what you think you heard.
You need to clearly understand the process and its reason for existence. Then you’re ready to begin
designing the warehouse.
Step 3: Identify Core Business Processes
By this point, you must have a clear idea of what business processes you need to correlate. You’ve
identified the key performance indicators, such as unit sales, units produced, and gross revenue. Now
you need to identify the entities that interrelate to create the key performance indicators. For instance,
at our example company, creating a training sale involves many people and business factors. The
customer might not have a relationship with the company. The client might have to travel to attend
classes or might need a trainer for an on-site class. New product releases such as Windows 2000
(Win2K) might be released often, prompting the need for training. The company might run a
promotion or might hire a new salesperson.
The data warehouse is a collection of interrelated data structures. Each structure stores key
performance indicators for a specific business process and correlates those indicators to the factors
that generated them. To design a structure to track a business process, you need to identify the
entities that work together to create the key performance indicator. Each key performance indicator is
related to the entities that generated it. This relationship forms a dimensional model. If a salesperson
sells 60 units, the dimensional structure relates that fact to the salesperson, the customer, the product,
the sale date, etc.
Then you need to gather the key performance indicators into fact tables. You gather the entities
that generate the facts into dimension tables. To include a set of facts, you must relate them to the
dimensions (customers, salespeople, products, promotions, time, etc.) that created them. For the fact
Brought to you by Microsoft and Windows IT Pro eBooks
8
A Jump Start to SQL Server BI
table to work, the attributes in a row in the fact table must be different expressions of the same
event or condition. You can express training sales by number of seats, gross revenue, and hours of
instruction because these are different expressions of the same sale. An instructor taught one class in
a certain room on a certain date. If you need to break the fact down into individual students and
individual salespeople, however, you’d need to create another table because the detail level of the
fact table in this example doesn’t support individual students or salespeople. A data warehouse
consists of groups of fact tables, with each fact table concentrating on a specific subject. Fact tables
can share dimension tables (e.g., the same customer can buy products, generate shipping costs, and
return times). This sharing lets you relate the facts of one fact table to another fact table. After the
data structures are processed as OLAP cubes, you can combine facts with related dimensions into
virtual cubes.
Step 4: Construct a Conceptual Data Model
After identifying the business processes, you can create a conceptual model of the data. You
determine the subjects that will be expressed as fact tables and the dimensions that will relate to the
facts. Clearly identify the key performance indicators for each business process, and decide the format
to store the facts in. Because the facts will ultimately be aggregated together to form OLAP cubes, the
data needs to be in a consistent unit of measure. The process might seem simple, but it isn’t. For
example, if the organization is international and stores monetary sums, you need to choose a
currency. Then you need to determine when you’ll convert other currencies to the chosen currency
and what rate of exchange you’ll use. You might even need to track currency-exchange rates as a
separate factor.
Now you need to relate the dimensions to the key performance indicators. Each row in the fact
table is generated by the interaction of specific entities. To add a fact, you need to populate all the
dimensions and correlate their activities. Many data systems, particularly older legacy data systems,
have incomplete data. You need to correct this deficiency before you can use the facts in the
warehouse. After making the corrections, you can construct the dimension and fact tables. The fact
table’s primary key is a composite key made from a foreign key of each of the dimension tables.
Data warehouse structures are difficult to populate and maintain, and they take a long time to
construct. Careful planning in the beginning can save you hours or days of restructuring.
Step 5: Locate Data Sources and Plan Data Transformations
Now that you know what you need, you have to get it. You need to identify where the critical
information is and how to move it into the data warehouse structure. For example, most of our
example company’s data comes from three sources. The company has a custom in-house application
for tracking training sales. A CRM package tracks the sales-force activities, and a custom timereporting system keeps track of time.
You need to move the data into a consolidated, consistent data structure. A difficult task is
correlating information between the in-house CRM and time-reporting databases. The systems don’t
share information such as employee numbers, customer numbers, or project numbers. In this phase
of the design, you need to plan how to reconcile data in the separate databases so that information
can be correlated as it is copied into the data warehouse tables.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 2 7 Steps to Data Warehousing 9
You’ll also need to scrub the data. In online transaction processing (OLTP) systems, data-entry
personnel often leave fields blank. The information missing from these fields, however, is often
crucial for providing an accurate data analysis. Make sure the source data is complete before you use
it. You can sometimes complete the information programmatically at the source. You can extract ZIP
codes from city and state data, or get special pricing considerations from another data source.
Sometimes, though, completion requires pulling files and entering missing data by hand. The cost
of fixing bad data can make the system cost-prohibitive, so you need to determine the most costeffective means of correcting the data and then forecast those costs as part of the system cost. Make
corrections to the data at the source so that reports generated from the data warehouse agree with
any corresponding reports generated at the source.
You’ll need to transform the data as you move it from one data structure to another. Some
transformations are simple mappings to database columns with different names. Some might involve
converting the data storage type. Some transformations are unit-of-measure conversions (pounds to
kilograms, centimeters to inches), and some are summarizations of data (e.g., how many total seats
sold in a class per company, rather than each student’s name). And some transformations require
complex programs that apply sophisticated algorithms to determine the values. So you need to select
the right tools (e.g., Data Transformation Services—DTS—running ActiveX scripts, or third-party tools)
to perform these transformations. Base your decision mainly on cost, including the cost of training or
hiring people to use the tools, and the cost of maintaining the tools.
You also need to plan when data movement will occur. While the system is accessing the data
sources, the performance of those databases will decline precipitously. Schedule the data extraction to
minimize its impact on system users (e.g., over a weekend).
Step 6: Set Tracking Duration
Data warehouse structures consume a large amount of storage space, so you need to determine how
to archive the data as time goes on. But because data warehouses track performance over time, the
data should be available virtually forever. So, how do you reconcile these goals?
The data warehouse is set to retain data at various levels of detail, or granularity. This granularity
must be consistent throughout one data structure, but different data structures with different grains
can be related through shared dimensions. As data ages, you can summarize and store it with less
detail in another structure. You could store the data at the day grain for the first 2 years, then move it
to another structure. The second structure might use a week grain to save space. Data might stay
there for another 3 to 5 years, then move to a third structure where the grain is monthly. By planning
these stages in advance, you can design analysis tools to work with the changing grains based on the
age of the data. Then if older historical data is imported, it can be transformed directly into the
proper format.
Step 7: Implement the Plan
After you’ve developed the plan, it provides a viable basis for estimating work and scheduling the
project. The scope of data warehouse projects is large, so phased delivery schedules are important for
keeping the project on track. We’ve found that an effective strategy is to plan the entire warehouse,
then implement a part as a data mart to demonstrate what the system is capable of doing. As you
complete the parts, they fit together like pieces of a jigsaw puzzle. Each new set of data structures
adds to the capabilities of the previous structures, bringing value to the system.
Brought to you by Microsoft and Windows IT Pro eBooks
10
A Jump Start to SQL Server BI
Data warehouse systems provide decision-makers consolidated, consistent historical data about
their organization’s activities. With careful planning, the system can provide vital information on how
factors interrelate to help or harm the organization. A solid plan can contain costs and make this
powerful tool a reality.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 3 The Art of Cube Design 11
Chapter 3:
The Art of Cube Design
By Russ Whitney, Tim Ramey
Cube design is more an art than a science. Third-party applications provide many templates and
patterns to help a cube designer create cubes that are appropriate for different kinds of analysis
(e.g., sales or budgeting). But in the end, the cube design depends on business rules and constraints
specific to your organization. What quirks of your data keep you up at night? In dimensions such as
Customers or Organization, you might have a hierarchy of values that change as often as you update
the cube. Or you might have members that you want to include in multiple places in a hierarchy, but
you don’t want to double-count the values when you aggregate those members. You can handle
each of these situations in multiple ways, but which way is best? In our business intelligence (BI)
development work, we see lots of problems in designing sales-forecasting cubes. Let’s look at a few
common cube-design problems and learn how to solve them by using some techniques that you can
apply to many types of cubes.
Designing a Sales-Forecasting Cube
When creating a sales-forecasting cube, a cube designer at our company typically gets the cube
dimensions from the customer relationship management (CRM) system that our sales team uses for
ongoing tracking of sales deals. In our CRM system, the pipeline (the list of sales contracts that
representatives are working on) puts sales deals into one of three categories: Forecast (the sales
representative expects to close the deal in the current quarter), Upside (the sales representative thinks
the deal will be difficult to close in the current quarter), and Other (the sales representative expects to
close the deal in a future quarter). Additionally, the projection defines who has agreed that a given
pipeline deal should be included in the current quarter’s forecast. Each deal in the pipeline falls into
one of four projection categories: Sales Rep Only (only the sales representative thinks the deal should
be in the forecast and the manager has overridden the sales representative to remove the deal from
the forecast), Manager Only (the manager has overridden the sales representative to include the
deal in the forecast), Sales Rep & Mgr (manager and representative agree that the deal should be in
the forecast), and Neither (nobody thinks the deal should be in the current quarter’s forecast). A
straightforward cube design might include a dimension called Projection that has a member for each
deal’s status and a dimension called Pipeline that has a member for each deal’s category, as Figure 1
shows.
Brought to you by Microsoft and Windows IT Pro eBooks
12
A Jump Start to SQL Server BI
Figure 1
A straightforward cube design
By choosing different combinations of Pipeline and Projection, you can quickly answer
questions such as “Which deals in the representative’s forecast did the sales manager and the sales
representative both agree to?” or “Which deals in the representative’s Upside category did the
manager override for the current quarter?” This dimension structure also lets users view the deals if
you’ve enabled drillthrough, so sales managers can quickly see which deals make up the forecast
number they’re committing to.
The problem with this dimension structure is that the Projection dimension is relevant only when
the user has selected the Pipeline dimension’s Forecast member. The sales representative is the only
person who puts deals in the Upside and Other categories. The sales manager is responsible for
agreeing or disagreeing with the sales representative’s deal categorization, but the manager’s input
affects only the Forecast member. If users choose one of the invalid combinations, they will see no
data—or even wrong data. For example, if the manager selects the deals in the current quarter’s
forecast but doesn’t select both the Sales Rep & Mgr and Manager Only projections, the projected
sales number that the cube reports for the current quarter’s forecast will be too low.
Providing Valid Data
One technique that would solve the wrong-data problem is the use of calculated members. You
could create a calculated member on the Pipeline dimension for each valid combination of Pipeline
and Projection, then hide the Projection dimension so that the manager needs to deal with only one
dimension. This technique would let sales managers easily see the target that they’d committed to for
the current quarter. The problem with this solution is that Analysis Services doesn’t support
drillthrough operations on calculated members. In a sales-forecasting application, drillthrough is a
mandatory feature because you need to be able to view the individual deals in the pipeline. Without
drillthrough, you lose the ability to see individual sales deals in the pipeline.
A better solution to this problem is one that you won’t find documented in SQL Server Books
Online (BOL). In this approach, you create one dimension that contains all the valid combinations.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 3 The Art of Cube Design 13
Figure 2 shows the new dimension (labeled Entire Pipeline), which combines the original
Pipeline and Projection dimensions into one dimension. You’ll notice two things in the new dimension that weren’t in the original dimensions. First, deals can appear in multiple locations in the hierarchy. For example, the deals comprising the Rep Commit member are also in the Mgmt Commit
member if the manager has also committed to them. Second, the aggregation of the members to calculate their parents’ values needs to use a custom rollup formula so that the aggregation doesn’t
count duplicated records more than once. We can solve both problems without duplicating rows in
the fact table by taking advantage of the way Analysis Services joins the dimension tables together to
compute cell values.
Figure 2
Creating a new dimension
Let’s look at the relationship between the fact-table entries and the Projection dimension table,
which Figure 3 shows. The members of the Pipeline dimension (which we would have determined
by using calculated members in the previous option) now have multiple rows in the dimension table.
This structure might disprove two common assumptions about dimension tables: the assumption that
the primary key in the dimension table must be unique and the assumption that a dimension
member must correspond to only one row in a dimension table. Because of the way Analysis Services uses SQL to join the fact table to the dimension table when it builds the cube, neither of these
assumptions is enforced. Using non-normalized tables lets us have one fact-table row that corresponds to multiple dimension members and one dimension member that corresponds to multiple categories of fact-table records. Multiple rows from the fact table have the same primary key, so those
rows are included in the calculated value for that dimension member. We can calculate the correct
values for every member in the dimension without increasing the size of the fact table. For very large
fact tables, this technique can be a big time-saver, both when you’re creating the fact table and when
you’re processing the cube.
Brought to you by Microsoft and Windows IT Pro eBooks
14
A Jump Start to SQL Server BI
Figure 3
The relationship between fact-table entries and the Projection dimension
Of course, when a fact-table record appears in more than one dimension member, the parents
of those members won’t necessarily contain the correct value. The default method of computing a
member’s value from its children is to sum the children’s values. But summing won’t work in this
case because some fact-table records would be included more than once in the parent’s total. The
solution is to use unary operators that you associate with each member in a custom rollup calculation. The dimension table in Figure 2 shows the custom-rollup unary operators for each member in
the dimension. The + unary operator means when the parent’s value is calculated, the calculation
should add the value to the parent member, and the ~ unary operator means the calculation should
exclude the value from the parent’s value. The Mgmt Commit member consists entirely of sales deals
included in other dimension members, so Analysis Services ignores this member when calculating the
value of its parent, Entire Pipeline. Analysis Services also needs to use a custom-rollup formula
within the Mgmt Commit member because that member’s value isn’t the sum of its children. The
Override-Excluded from Reps member is important for the manager to have available for analysis
because it shows which deals the sales representative included in the forecast but the manager didn’t
commit to. However, these deals aren’t part of the Mgmt Commit value, so Analysis Services needs to
ignore them when aggregating the children of Mgmt Commit.
Now we have a cube structure that meets our needs. All the information associated with sales
deals is in one dimension and the rollup formulas are computed, so the aggregate values in the
dimension are correct. Because we used no calculated members, we can still enable drillthrough to
see individual sales deals. When you deploy this cube to your sales team, you can be confident that
the query results are accurate. You will be praised by your coworkers and be showered with gifts
and money—or maybe you’ll simply help your company’s bottom line.
You can apply these techniques to many types of cubes. If you ever get into a situation in which
you want to duplicate fact-table records in a dimension without duplicating them in the fact table, the
combination of duplicating keys and using custom rollup formulas can be a great benefit.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 15
Chapter 4:
DTS 2000 in Action
By Larry Barnes
The first version of Data Transformation Services (DTS), which Microsoft introduced with SQL Server
7.0, gave database professionals an easy-to-use, low-cost alternative to more expensive products in
the data extraction, transformation, and loading (ETL) market. The first versions of most products
leave gaps in their coverage, however, and DTS was no exception. Microsoft provided several
enhancements in SQL Server 2000 that significantly increase DTS’s power and usability. Two new
tasks, as well as upgrades to an existing task, are standout improvements. Let’s walk through a data
ETL scenario that showcases these features as you create a SQL Server data mart from the FoodMart
sample database that ships with SQL Server 2000.
Introducing the Create FoodMart 2000 Package
How many times have you wished that you could put SQL Server through its paces on a database
larger than Northwind and Pubs? Actually, SQL Server ships with the larger FoodMart sample database, which is the source database for the FoodMart Analysis Services cube. The FoodMart database
has just one drawback—it’s a Microsoft Access database. I created a set of DTS packages that takes
the Access database and moves it to SQL Server. This scenario provides a good framework for introducing DTS’s key new features.
Before diving into the details, let’s look at Figure 1, which shows the Create Foodmart 2000 DTS
package.
Figure 1
The Create Foodmart 2000 DTS package
Brought to you by Microsoft and Windows IT Pro eBooks
16
•
•
•
•
•
A Jump Start to SQL Server BI
You can break down this package into 15 tasks that you group into five main steps:
initializing global variables and the package state (Tasks 1—2)
deleting the FoodMart database if it exists (Tasks 3—6)
creating the FoodMart database and tables (Tasks 7—10)
moving data from Access to SQL Server (Task 11)
cleansing the data, creating star indexes, and adding referential integrity (Tasks 12—15)
Before looking at these steps in detail, let’s look at global variables—the glue that holds the
package together.
Initializing Global Variables and the Package State
Global variables are the nerve center of a DTS package because they provide a central location for
DTS to share information. To create, view, and set global variable values, go to the DTS Package
Designer’s toolbar, select Package Properties from the menu, then click the Global Variables tab,
which Figure 2 shows. SQL Server 2000’s enhanced task support for global variables incorporates
multiple task types—including ActiveX Script, Dynamic Properties, and Execute SQL tasks—which can
set and retrieve global variable values. DTS 2000 and DTS 7.0 also support a wide range of data
types, including COM components.
Figure 2
Global Variables tab
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 17
The ActiveFoodMartConnections global variable, which Figure 2 shows, is an example of a COM
component. This global variable, which I created as an output parameter in Task 4, stores an ADO
Recordset object that contains records describing all active FoodMart connections.
Task 1: Initializing global variables. To initialize the package global variables, you can write
VBScript code into an ActiveX Script task, as Listing 1 shows.
Listing 1: Script That Initializes Package Global Variables
Function Main()
‘ Set the parameters required to initialize a package at runtime.
DTSGlobalVariables(“CopyFoodmartPackage”).Value
= “d:\demos\sql2000\foodmart\Foodmart Copy Tables.dts”
DTSGlobalVariables(“CopyFoodmartPackageName”).Value = “Foodmart Copy Tables”
DTSGlobalVariables(“PackageGuid”).Value= “{D0508D1B-6642-4DDD-8508-2F5DBA726C1A}”
‘ Set the Access Database filename and the SQL Server connection parameters.
DTSGlobalVariables(“AccessDbFileName”).Value
= “C:\Program Files\Microsoft Analysis Services\Samples\Foodmart 2000.mdb”
DTSGlobalVariables(“SQLServerName”).Value = “(local)”
DTSGlobalVariables(“DatabaseName”).Value = “FoodMart2000”
DTSGlobalVariables(“Username”).Value = “sa”
DTSGlobalVariables(“Password”).Value= “”
‘ Set the Directory that holds the SQL Server database files.
Main = DTSTaskExecResult_Success
End Function
In VBScript, global variable assignments take the form
DTSGlobalVariables(“name”).Value = “Input-value”
where name is the global variable’s name and Input-value is the value that you assign to the global
variable. Note that although I use VBScript for all packages, you can also use any other installed
ActiveX scripting language, such as JScript or Perl.
Task 2: Using .ini files to initialize global variables. Now, let’s look at the way the new
Dynamic Properties task removes one of DTS 7.0’s major limitations—the inability to set key
package, task, and connection values at runtime from outside the DTS environment. In DTS 7.0,
developers had to manually configure packages as they moved through the package life cycle—from
development to testing and finally to production. With DTS 2000, the package remains unchanged
through the development life cycle; only the parameter settings made outside the package change.
In this example, I use Windows .ini files to initialize the global variables. You can also initialize
environment variables, database queries, DTS global variables, constants, and data files. Figure 3
shows the global variables that you can initialize. Note that the window also includes Connections,
Tasks, and Steps properties.
Brought to you by Microsoft and Windows IT Pro eBooks
18
A Jump Start to SQL Server BI
Figure 3
Global variables you can initialize
Later in this chapter, I show you how to initialize both Connections and Tasks properties. Each
global variable is linked to one entry within the specified .ini file. Figure 4 shows the Add/Edit
Assignment dialog box, in which you initialize the SQLServerName global variable with the SQLServerName key from the C:\Create-foodmart.ini file.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 19
Figure 4
Add/Edit Assignment dialogue box
Listing 2 shows the Createfoodmart.ini file code. Note that this .ini file is the only parameter in
this package that isn’t dynamic. You need to place it in the C directory or modify the task to point to
the .ini file’s new location.
Listing 2: Code for the Createfoodmart.ini File
[Foodmart Parameters]
CopyFoodmartPackage=d:\demos\sql2000\foodmart\Foodmart Bulk Copy Tables.dts
CopyFoodmartPackageName=Foodmart Bulk Copy Tables
PackageGuid={F4EE2316-97BE-43CA-9C2B-3371972435D3}
AccessDbFileName=C:\Program Files\Microsoft Analysis Services\Samples\Foodmart 2000.mdb
DatabaseDir=d:\demos\sql2000\foodmart\
SQLServerName=(local)
DatabaseName=FoodMart2000
Username=sa
The next two instances of the Dynamic Properties task use these initialized global variables to
dynamically set important connection information, the SQL Server database files directory, and the
CopyFoodMart DTS package filename, package name, and package GUID. The next four tasks delete
active FoodMart database users and drop any existing FoodMart database to make sure that the
system is ready for the database creation.
Brought to you by Microsoft and Windows IT Pro eBooks
20
A Jump Start to SQL Server BI
Preparing the Existing Environment
Task 3: Setting the connection parameters. The power of the Dynamic Properties task becomes
evident when you set the connection parameters. The Dynamic Properties task uses the global
variables that the .ini files have already initialized to initialize SQL Server OLE DB connection
properties. DTS in turn uses the connection properties to connect to SQL Server. On the General tab
in the Dynamic Properties Task Properties window, which Figure 5 shows, you can see that global
variables set three connection parameters and a constant value sets one parameter.
Figure 5
Dynamic Properties Task Properties General tab
Clicking Edit brings you to the Dynamic Properties Task: Package Properties window, which
Figure 6 shows. The window displays the specific property (in this case the OLE DB Data Source
property) that the global variable is initializing. Clicking Set takes you back to the Add/Edit
Assignment dialog box.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 21
Figure 6
Dynamic Properties Task: Package Properties Window
Task 4: Getting the FoodMart connection. After you set the connection parameters, you need
to drop the existing FoodMart database. If users are logged in to the database, you have to terminate
their sessions before you take that action. Figure 7 shows the General tab in the Execute SQL Task
Properties window, which resembles the same tab in DTS 7.0. However, the Execute SQL Task
Properties window in DTS 2000 incorporates the new Parameters button and the new “?” parameter
marker in the SQL query.
Brought to you by Microsoft and Windows IT Pro eBooks
22
A Jump Start to SQL Server BI
Figure 7
Execute SQL Task Properties General tab
Clicking the Parameters button takes you to the Input Parameters tab in the Parameter Mapping
window, which Figure 8 shows. This window lets you pass input parameters into the Execute SQL
task and place output parameters from the Execute SQL task in global variables—actions you can’t
take in SQL Server 7.0. Let’s take a closer look.
Figure 8
Parameter Mapping Input Parameters tab
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 23
In the Parameter Mapping window, any global variable can set the SQL parameter marker,
named Parameter 1. For this task, you pass the input FoodMart database name into the query by
using the DatabaseName global variable. DTS 2000 packages give you the flexibility to specify the
database name at runtime. In contrast, SQL Server 7.0 requires you to use additional SQL statements
within the task to accomplish the same goal. Figure 9 shows how you cache the query’s output
recordset for use in the next task. On the Output Parameters tab, you can store one value at a time
by first choosing the Row Value option, then mapping the SELECT LIST values one-to-one with
global variables. You can use all values or a subset.
Figure 9
Parameter Mapping Output Parameters tab
The ability to pass input parameters into the SQL task and place output parameters from the SQL
task in global variables, as well as to store one value at a time, might seem minor at first. However,
these features let you use the Execute SQL task in more places, providing a high-performance
alternative to the DTS data pump transformation capability. As a general rule, set-based operations
perform better than transformations. When I assembled DTS packages in SQL Server 7.0, I had to
include additional SQL code within each task to set the correct input parameters and use temporary
tables to store output parameters. In DTS 2000, you can eliminate from each SQL task the code you
had to write in DTS 7.0 for passing input parameters and storing output parameters. In eliminating
the code, you reduce the volume and complexity of code and therefore the time required to develop
and test your DTS packages.
Task 5: Killing the FoodMart connections. To terminate processes that are accessing the
FoodMart database, apply the SQL Server KILL command. Task 5’s ActiveX script code loops through
the rowset that is stored in the ActiveFoodMartConnections global variable, calling the code that
Listing 3 shows. First, the ActiveX script builds the database connection string from DTS global variables, then saves the connection as a DTS global variable that future ActiveX Scripting tasks can use
Brought to you by Microsoft and Windows IT Pro eBooks
24
A Jump Start to SQL Server BI
without first having to define it. You can use this connection to build and execute one KILL command for every server process ID (SPID) in the output rowset. After you kill all connections, you’re
ready to drop the existing FoodMart database.
Listing 3: Code That Kills the FoodMart Connections
Function Main()
‘ Get SQL Server Connection parameters.
srvName = DTSGlobalVariables(“ServerName”).Value
dbName = DTSGlobalVariables(“DatabaseName”).Value
strUserName = DTSGlobalVariables(“UserName”).Value
strPassword = DTSGlobalVariables(“Password”).Value
‘ Build the ADO Connection string and connect to SQL Server.
Set cn = CreateObject(“ADODB.Connection”)
strCn = “Provider=SQLOLEDB;Server=” & srvName & “;User Id=” & strUserName &
“;Password=” & strPassword & “;”
cn.Open strCn
‘ Cache this database connection.
Set DTSGlobalVariables(“DatabaseConnection”).Value = cn
‘ Loop through the recordset that the previous Execute SQL task returned
‘ Kill each connection accessing FoodMart before dropping the database.
Set rs = DTSGlobalVariables(“ActiveFoodMartConnections”).Value
while rs.EOF <> True
strSQL = “Kill “ & cstr(rs(0))
cn.Execute strSQL
rs.MoveNext
wend
‘ Clean up.
rs.Close
Set DTSGlobalVariables(“ActiveFoodMartConnections”).Value = Nothing
Main = DTSTaskExecResult_Success
End Function
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 25
Task 6: Dropping FoodMart. The ActiveX script that you run for Task 6 retrieves the ADO
connection that you cached in the previous task, as Listing 4 shows. Then, you build the DROP
DATABASE statement and execute it. Note that you have to build the statement explicitly each time
for both the KILL and DROP DATABASE commands because the SQL Data Definition Language
(DDL) doesn’t support the “?” parameter marker. For that reason, you can’t pass the database or SPID
as an input parameter at the same time you pass the FoodMart database name. Now that you’ve finished cleaning up the environment, you’re ready to build the new FoodMart database. Note that you
designate the workflow from Task 6 to Task 7 as On Completion not On Success. You want the
package to continue executing if the DROP DATABASE command failed because the database didn’t
exist. To change the workflow precedence, highlight the workflow arrow that connects Task 6 to
Task 7, right-click, select Properties, then select Completion Success or Failure from the Precedence
drop-down combo box.
Listing 4: Code That Drops the FoodMart Database
Function Main()
on error resume next
‘ Get the connection and drop the named database.
Set cn = DTSGlobalVariables(“DatabaseConnection”).Value
dbName = DTSGlobalVariables(“DatabaseName”).Value
cn.Execute “DROP DATABASE “ & dbName
Main = DTSTaskExecResult_Success
End Function
Creating the FoodMart Database and Tables
You might wonder why I haven’t recommended using the Access Upsizing Wizard to move the
FoodMart database to SQL Server. Although the Upsizing Wizard, which became available in Access
95, is a helpful tool that easily migrates Access databases to SQL Server, the wizard doesn’t work as
well for large Access databases such as FoodMart. For these databases, you need to stage an Accessto-SQL Server migration in multiple steps similar to the steps in this example—creating the database,
creating the database objects, loading the database, cleansing the data, and adding referential
integrity. In deciding which utility to use, you have to take into account such factors as the underlying physical database design, table design, data type selection, and how much flexibility you have
in determining when to move and cleanse data.
Task 7: Creating FoodMart. The script that Listing 5 shows creates the FoodMart2000_Master
database that appears in the Data Files tab on the FoodMart2000 Properties window in Figure 10.
Note that the database’s size is 25MB, expandable by 10MB. Although the database can grow to
35MB, it reclaims this space when you issue a DBCC ShrinkDatabase operation from the cleanup
task. Again, I used an ActiveX Script task rather than an Execute SQL task to specify at runtime the
database name and the directory in which I wanted to create the new database files. I used the
scripting task because DDL statements don’t support parameter markers.
Brought to you by Microsoft and Windows IT Pro eBooks
26
A Jump Start to SQL Server BI
Listing 5: Code That Creates the New FoodMart Database
Function Main()
‘ Get the FoodMart database name and the directory where you want to
‘ create the database files.
dbName = DTSGlobalVariables(“DatabaseName”).Value
strDir = DTSGlobalVariables(“DatabaseDir”).Value
‘ Append the directory string with a delimiter if necessary.
pos = InStr(Len(strDir), strDir, “\”)
If pos = 0 Then
strDir = strDir & “\”
End If
‘ Get the open connection.
Set cn = DTSGlobalVariables(“DatabaseConnection”).Value
‘ Create the Primary Database, its filename and log file.
strSQL = “CREATE DATABASE [“ & dbName & “] “
BuildDBFile strSQL, dbName, strDir, “_Master”, “.MDF”, “25”, “10”
strSQL = strSQL & “ LOG “
BuildDBFile strSQL, dbName, strDir, “_LOG”, “.LDF”, “20”, “10”
cn.Execute strSQL
Main = DTSTaskExecResult_Success
End Function
Private Sub BuildDBFile(iSQL, iDBName, iDir , iAppend , iExt , iSize , iFileGrowth )
iSQL = iSQL & “ ON (NAME = N’” & iDBName & iAppend & “‘,”
iSQL = iSQL & “ FileName = N’” & iDir & iDBName & iAppend & iExt & “ ‘,”
AddSize iSQL, iSize, iFileGrowth
End Sub
Private Sub AddSize(iSQL, iSize, iFileGrowth)
iSQL = iSQL & “ SIZE=” & iSize & “, FILEGROWTH = “ & iFileGrowth & “)”
End Sub
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 27
Figure 10
FoodMart 2000 Properties Data Files tab
Task 8: Setting the database properties. Set Database Properties is an ActiveX Script task that
initializes database level settings by calling the sp_dboption stored procedure. One of the database
options settings worth noting here is bulkcopy, which the ActiveX script code sets to true. Bulkcopy’s
true setting lets the data load faster because it means that SQL Server doesn’t log row-insert operations. However, be aware that for nonlogged bulk load to work, your database settings must meet
additional conditions. These conditions are well documented in SQL Server Books Online (BOL).
Task 9: Initializing FoodMart’s connections. The Initialize Food-Mart Connections task initializes FoodMart’s SQL Server OLE DB connection and the parameters required for the Execute Package
task. Figure 11 shows the General tab in the Dynamic Properties Task Properties window. You’ve
already set the OLE DB properties, so let’s set a task parameter. Clicking Edit on the General tab and
highlighting the PackageGuid destination property opens the Package Properties window, which
Figure 12 shows. In this window, you can select the task, the PackageID, and the PackageID’s default
value. Once again, the Dynamic Properties task gives you maximum flexibility for configuring a property at runtime, a capability that’s vital when you move a package between environments.
Brought to you by Microsoft and Windows IT Pro eBooks
28
A Jump Start to SQL Server BI
Figure 11
Dynamic Properties Task Properties General tab
Figure 12
Dynamic Properties Task: Package Properties
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 29
Task 10: Creating tables. After you choose the package properties, you can create 24 database
tables and populate them. Note the size of the FoodMart database—too large to use Access’s Upsizing
Wizard. FoodMart holds enough data to warrant the explicit creation of the database schema to
optimize the final database size. Your next step is to run the initial load.
Task 11: Moving data from Access to SQL Server. Many ETL projects are complex enough to
warrant the separation of logic into multiple packages. When you use SQL Server 7.0, linking these
multiple packages together in a workflow is a challenge. The technique commonly used—creating an
EXECUTE process command, then using the dtsrun command-line interface—is a cumbersome
solution. In addition, in SQL Server 7.0 you can’t set runtime parameters.
SQL Server 2000 addresses both shortcomings with a new task, the Execute Package task. You
use this task to invoke the DTS package that moves data from Access to SQL Server. I examine the
package in more detail later in this chapter. First, let’s look at the General tab in the Execute Package
Task Properties window, which Figure 13 shows. Task 9 sets the key values for this package at
runtime. The window in Figure 12 displays the available properties. Be aware that for all tasks, the
minimum properties you need to set are the PackageName, the package FileName, and the PackageGuid so you can dynamically set the package properties to work correctly at runtime.
Figure 13
Executive Package Task Properties General tab
The Execute Package task incorporates another valuable feature: You can initialize the called
package from the task in the Execute Package Task Properties window. To initialize the called
package, you can choose either the Inner Package Global Variables tab or the Outer Package Global
Variables tab, which Figure 14 shows. For this example, I used Outer Package Global Variables to
initialize global variables of the same name within the called package. Figure 15 shows the called
package that you use to copy the data from Access to SQL Server. This package uses a technique
Brought to you by Microsoft and Windows IT Pro eBooks
30
A Jump Start to SQL Server BI
similar to the initialization technique that the main package uses. After the initialization task completes, each of the 24 transformation tasks fire and complete independently of one another.
Figure 14
Outer Package Global Variables tab
Figure 15
OuterPackages Global Variables: the called package
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 31
In each transformation, you map the source to the destination column. DTS refers to this action
as the data pump. Figure 16 shows the transformations for the account table in the Transform Data
Task Properties window, Transformations tab. You can set one transformation for the entire row, as
Figure 16 shows, or map the table column-to-column, as Figure 17 shows. You might expect that
minimizing the number of transformations would significantly speed up the copy task’s performance.
However, my SQL Server Profiler tests showed that the timing results are similar for both packages.
One of the test runs revealed that both techniques use the BULK INSERT command to transfer information to SQL Server. BULK INSERT used as a default command is another new SQL Server 2000 feature. When you use BULK INSERT capabilities, you can greatly improve execution time for your
transformation tasks. However, this performance gain comes at a cost: Inserting data in bulk mode
doesn’t work with the new SQL Server 2000 logging features.
Figure 16
Transform Data Task Properties Transformations tab
Brought to you by Microsoft and Windows IT Pro eBooks
32
A Jump Start to SQL Server BI
Figure 17
Mapping the table column-to-column
To understand the problem, let’s look at Figure 18, which shows the Options tab for one of the
transformations. Note that the Use fast load option is enabled by default for a copy transformation.
Disabling this feature changes the method of loading the destination data rows from a nonlogged,
bulk-load interface to a logged interface. The quick Profiler timing tests I ran on my machine show
that the task runtime is more than 10 times longer when you disable Use fast load. However, when
you run a transformation with Use fast load enabled, you can’t take advantage of one of the new
SQL Server 2000 logging features, which lets you save copies of all rows that fail during the
transformation. This logging feature is valuable because it lets you log and later process all failing
rows for a particular transformation. ETL processing often requires you to make choices—and a
trade-off accompanies every choice. Here, you must decide between set-based processing and
row-based processing when you build your transformations. Set-based processing usually provides
better performance, whereas row-based processing gives you more flexibility. I use set-based processing in the next two tasks, in which I cleanse the data and create primary keys and referential
integrity.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 33
Figure 18
Transform Data Task Properties Options tab
Tasks 12 and 13: Cleansing the data. The FoodMart Access database suffers from data-quality
problems. For this exercise, let’s look at the snowflake schema for the sales subject area, whose key
values and structure Figure 19 shows.
Brought to you by Microsoft and Windows IT Pro eBooks
34
A Jump Start to SQL Server BI
Figure 19
Snowflake schema for the sales subject area
The sales_fact 1997 table holds foreign key references to FoodMart’s key dimensions: products,
time, customers, promotions, and store geography. I structured the products and store dimensions in
a snowflake pattern to reflect the hierarchies for each dimension; for example, each product has a
product family, department, category, subcategory, and product brand. Several rows duplicate
between sales_fact_1997 (8 rows) and sales_fact_1998 (29 rows). If you want to apply star indexes
and referential integrity to the star schema, you have to purge the duplicated data. This challenge is
nothing new to developers with data warehouse experience; typically 80 percent of total project time
is spent on data cleansing. The ETL developer has to decide whether to use set-based processing or
row-based processing for the data-cleansing phase of the project. For this example, I used set-based
processing. To cleanse the sales_fact_1997 table, you can run the SQL code that Listing 6 shows.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 35
Listing 6: Code That Cleanses the sales_fact_1997 Table
SELECT time_id, product_id,store_id, promotion_id, customer_id, COUNT(*) AS dup_count
INTO #tmp_sales_fact_1997
FROM dbo.sales_fact_1997
GROUP BY time_id, product_id,store_id, promotion_id, customer_id
HAVING COUNT(*) > 1
BEGIN TRANSACTION
DELETE dbo.sales_fact_1997 FROM dbo.sales_fact_1997 s
INNER JOIN #tmp_sales_fact_1997 t
ON s.time_id = t.time_id
AND s.product_id = t.product_id
AND s.promotion_id = t.promotion_id
AND s.customer_id = t.customer_id
DROP TABLE #tmp_sales_fact_1997
COMMIT
The first step in cleansing the data is to find all the rows that contain duplicate entries and create
a spot to store them; in this example, the code stores the results in a temporary table. Next, it deletes
the duplicate entries from the fact table. Then, the code deletes the table that it used to store the
duplicates. Note that in using set-based processing to cleanse information before inserting it into the
star or snowflake schema, you introduce data loss because you don’t re-insert the distinct duplicate
rows that contain identical key values into the table. I decided to use set-based processing in this
example because I don’t know enough about the underlying data to determine which of the
duplicate rows is the correct one. In a real project, you place these duplicate rows in a permanent
table in a data warehouse or data mart metadata database that you establish to store rows that fail the
data-cleansing process. You can then examine these failed rows to determine what exception
processing should occur. The data mart database also stores additional information about package
execution, source data, and other key information that describes and documents the ETL processes
over time. After cleansing the fact tables, you can create star indexes and add referential integrity.
Task 14: Creating star indexes. Task 14 creates a primary key for both the sales_fact_1997 and
sales_fact_1998 tables. The primary key, which is also called a star index, is a clustered index that
includes each of the fact tables’ foreign keys that reference a dimension’s primary key. You can
realize several benefits from creating primary keys; one significant benefit is that the query optimizer
can use this primary key for a clustered index seek rather than a table scan when it builds its access
plan. The query optimizer takes advantage of the star index in the code example that Listing 7 shows.
Note that the query-access patterns demonstrate how much the star index can speed up your queries;
for example, the execution time in the query that Listing 7 shows plummeted by two-thirds when I
added the star index. Using a star index in queries for very large databases (VLDBs) carries another
important benefit: The query optimizer might decide to implement a “star join,” which unions the
smaller dimensions together before going against the fact table. Usually, you want to avoid unions
Brought to you by Microsoft and Windows IT Pro eBooks
36
A Jump Start to SQL Server BI
for the sake of efficient database optimization. However, a star join is a valid and clever optimization
technique when you consider that the fact table might be orders of magnitude larger than its
dimensions.
Listing 7: Code That Creates a Primary Key (Star Index) for the sales_fact Tables
SELECT t.the_year, t.quarter, p.brand_name, c.state_province, c.city, s.store_name,
SUM(sf.store_sales) AS Sales, SUM(sf.store_cost)
AS Cost, SUM(sf.unit_sales) AS “Unit Sales”
FROM sales_fact_1998 sf INNER JOIN
customer c ON sf.customer_id = c.customer_id
INNER JOIN product p ON sf.product_id = p.product_id
INNER JOIN time_by_day t ON sf.time_id = t.time_id
INNER JOIN store s ON sf.store_id = s.store_id
WHERE t.the_year = 1998 AND t.quarter = ‘Q3’ AND p.brand_name = ‘Plato’
GROUP BY t.the_year, t.quarter, p.brand_name, c.state_province, c.city, s.store_name
ORDER BY t.the_year, t.quarter, p.brand_name, c.state_province, c.city, s.store_name
Task 15: Adding referential integrity. The last major task in this package adds referential
integrity, which links all the star schema’s foreign keys to their associated dimensions’ primary keys.
As a general rule, adding referential integrity is beneficial because it ensures that the integrity of the
data mart is uncompromised during the load phase. Administrators for large data warehouses might
choose not to implement this step because of the extra overhead of enforcing referential integrity
within the database engine. Cleanup, the final task, uses an ActiveX script to invoke DBCC
ShrinkDatabase and to clean up the connection that the global variables are storing. A productionquality DTS package includes additional tasks, such as a mail task that sends the status of the
package execution to the DBA team.
Change Is Good
Sometimes little things make a big difference. This maxim is certainly true for SQL Server 2000’s DTS
enhancements. The Create Foodmart 2000 package showcases two new tasks in particular: the
Dynamic Properties and Execute Package tasks, which help DTS programmers implement productionquality packages. And when Microsoft added I/O capabilities to the Execute SQL task, the company
established global variables as the hub of activity within a DTS package.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 5 Rock-Solid MDX 37
Chapter 5:
Rock-Solid MDX
By Russ Whitney
The MDX language is powerful but not easy to use. On the surface, MDX looks like SQL, but it can
quickly become more complex because of the multidimensional nature of the underlying cube data.
After more than 3 years of using MDX, I’ve found I’m more productive when I apply design and
debugging techniques that help me better understand MDX and create more accurate MDX
statements. The techniques I use for developing MDX are similar to those I use for developing
software in other languages: for complex problems, I use pseudo coding and debug the code by
displaying intermediate results. Let’s look at an example of how I use these techniques, and along the
way, I’ll show you how to use a .NET language to develop MDX user-defined functions (UDFs).
If you have any formal software-development education, you know that to solve a complex
problem, you first break the problem into parts and solve each part independently. Then, it’s always
a good idea to step through each line of your code, using a debugger to verify that the code works
as intended. Most software developers know that these practices are good habits, but not enough
programmers apply them. These good programming habits can help you effectively deal with MDX’s
complexity.
For example, say you need to answer a typical business question such as, “Based on unit sales,
what are the top three brand names for each product department?” The MDX query that Listing 1
shows answers the question; Figure 1 shows the results.
Listing 1: Query That Returns the Top Three Brand Names Based on Unit Sales
SELECT {[Unit Sales]} ON COLUMNS,
GENERATE( [Product Department].MEMBERS, {
[Product].CURRENTMEMBER,
TOPCOUNT( DESCENDANTS(
[Product].CURRENTMEMBER,
[Brand Name] ),
3, [Unit Sales]
) } ) ON ROWS
FROM Sales
Brought to you by Microsoft and Windows IT Pro eBooks
38
A Jump Start to SQL Server BI
Figure 1
The results generated by the query in Listing 1
I used the familiar FoodMart 2000 Sales cube that comes with Analysis Services as the basis for
my example. I have enough experience with MDX that when I wrote this query, it ran the first time
(thus I skipped the good habit of breaking the code into parts). But the query is complicated because
it performs ranking (TOPCOUNT) inside an iterative loop (GENERATE), and I wasn’t sure I was getting the answer I really wanted. Let’s see how I work through the problem in a way that emphasizes
modularity (i.e., addressing each part of the problem separately) and accuracy. First, I use a design
methodology called pseudo coding. Pseudo coding is a process of writing in plain language the steps
for how you plan to implement your solution. For this problem, I want my code to follow the process that the pseudo code below describes.
For each product department,
1. find the set of all brand names for this product department
2. return the product department name
3. return the three brand names that have the most unit sales
When I start to translate this pseudo code into MDX, I get the following:
<<ANSWER>> = GENERATE( [Product
Department].MEMBERS, <<Dept
and Top Brands>> )
Here, the GENERATE() function steps through a set of items and evaluates an MDX expression
for each item in the set. This statement shows that to get the answer, I need to determine the product
department name and the top brand names within it for each product department. Next, I expand the
<<Dept and Top Brands>> item in the previous statement to call out the current product department.
The following expression shows that I need another expression to determine the top brands within
this department:
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 5 Rock-Solid MDX 39
<<Dept and Top Brands>> =
{ [Product].CURRENTMEMBER, <<Top
Brands Within Dept>> }
To determine the top brands within the product department, I use the TOPCOUNT() function
and specify that I want the top three brands based on unit sales:
<<Top Brands Within Dept>>
= TOPCOUNT( <<Brands Within
Dept>>, 3, [Unit Sales] )
Finally, I determine the brands within the product department by using the DESCENDANTS()
function with the selected product department:
<<Brands Within Dept>> =
DESCENDANTS( [Product]
.CURRENTMEMBER, [Brand Name] )
Remember, the GENERATE() function steps through the product departments and sets the
product dimension’s CURRENTMEMBER to the name of the current product department while evaluating the inner MDX expression.
If I take the MDX code fragments I created above and use the WITH statement to turn the code
into a modular MDX statement, I get the MDX statement that Listing 2 shows. In Listing 2, I’ve used
WITH statements to separate two of the three pseudo code steps from the main body of the query
(SELECT ...FROM) to improve readability and make the overall query use a more modular approach
to solve the problem. If I execute this new MDX statement in the MDX Sample Application, I get the
answer that Figure 2 shows. Notice that Figure 2’s results aren’t the same as Figure 1’s even though I
used the same MDX functions to develop the queries. Which answer is correct?
Listing 2: Modular Version of Listing 1’s MDX Statement
WITH
SET [Brands Within Dept] AS
‘DESCENDANTS(
[Product].CURRENTMEMBER, [Brand
Name])’
SET [Top Brands Within Dept] AS
‘TOPCOUNT( [Brands Within Dept], 3,
[Unit Sales] )’
SELECT {[Unit Sales]} ON COLUMNS,
GENERATE( [Product
Department].MEMBERS,{
[Product].CURRENTMEMBER, [Top Brands
Within Dept] } ) ON ROWS
FROM Sales
Brought to you by Microsoft and Windows IT Pro eBooks
40
A Jump Start to SQL Server BI
Figure 2
The results generated by the MDX statement
Close examination reveals that Figure 2 definitely doesn’t show the right answer. For one thing,
Hermanos isn’t a brand in the Alcoholic Beverages department. But even if you didn’t know that Hermanos belongs in the Produce department, you’d likely notice that the Unit Sales values of the three
brands listed as the top brands in the Alcoholic Beverages department (Hermanos, Tell Tale, and
Ebony) total more than the amount for the whole Alcoholic Beverages department ($6838.00). These
two incongruities prove that Figure 2 shows the wrong answer, but how can I find out whether
Figure 1 shows the correct answer?
To answer this question and to understand how MDX executes this query and other complex
queries, I developed a simple MDX debugging tool. This tool is an MDX UDF that uses the Windows
MessageBox() function to display any string. The UDF lets you display on the screen intermediate
results inside an MDX query while the query is executing. Listing 3 shows the UDF’s source code,
which I wrote in C#.
Listing 3: MDX UDF Written in C#
using System;
using System.Windows.Forms;
using System.Runtime.InteropServices;
namespace dotNETUDFs
{
/// <summary>
/// Functions for use in MDX
/// </summary>
[ClassInterface(ClassInterfaceType.AutoDual)]
public class MDXFuncs
{
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 5 Rock-Solid MDX 41
private int counter = 10;
// Constructor
public MDXFuncs()
{
MessageBox.Show(“MDXFuncs constructed”);
}
// This controls how many
// message boxes will display
// before the
// the next Reset() must be called.
public int Reset(int Count)
{
counter = Count;
return Count;
}
// Displays a message box with
// the specified caption and contents
// and returns the
// contents. Once the counter
// goes to zero, you must call
// Reset() again for messages to appear.
public string MsgBox(string caption, string sList)
{
if (counter > 0)
{
MessageBox.Show(sList, caption);
counter -= 1;
}
return sList;
}
}
}
Brought to you by Microsoft and Windows IT Pro eBooks
42
A Jump Start to SQL Server BI
It took me a while to figure out the steps for developing a UDF with C#. So if you haven’t
already developed an MDX UDF with a .NET language, here are the steps you need to follow:
1. Create a .NET-project type of class library.
2. Edit the line in the AssemblyInfo.cs file that contains the AssemblyVersion information so that
it contains a hard-coded version number rather than an auto-generated version number. In my
UDF, I used the following line:
[assembly: AssemblyVersion(“1.0.0.0”)]
.NET is picky about assembly version numbers, and without a constant version number, I
couldn’t get MDX to recognize my UDFs.
3. Open the Project-Properties dialog box and change the Register for COM Interop flag in the
Build properties to TRUE. This change registers your .NET class library as a COM DLL, which
is required for MDX UDFs.
4. Place a ClassInterface statement just before the start of the class definition, as Listing 3 shows.
This statement tells Visual Studio how to expose the class to the COM interoperability layer.
5. Add a using System.Runtime.Interop Services statement at the start of your C# source file, as
Listing 3 shows. The ClassInterface statement in Step 4 requires InteropServices.
When these steps are complete, you’re ready to add methods to your class definition, compile
them, and use them from MDX. For my UDF, I created a method called MsgBox() that displays on
the screen a box containing a message and caption that I specified as the method’s parameters. The
method returns the message that it displays so that you can embed the method in the middle of an
MDX query without altering the query results.
Compiling a C# project creates a DLL and a TLB file in the project’s bin/Debug subdirectory. The
TLB file is the COM type library that you need to register with Analysis Services to make your C#
methods available for use. I used the following statement in the MDX Sample Application to register
my type library. Note that dotNETUDFs is the name I chose for my C# project.
USE LIBRARY “C:\Documents and Settings\rwhitneyMy Documents\Visual Studio Projects\dotNETUDFs\bin\DebugdotNETUDFs.tlb”
After the library is registered, you can immediately start using the C# methods. The query in
Listing 4 shows the code I used to embed the C# MsgBox() method inside Listing 3’s MDX query.
MsgBox() requires and returns only string items, but the TOPCOUNT() function returns a set of members. To make the two functions compatible, I sandwiched the MsgBox() method between the MDX
functions STRTOSET() and SETTOSTR() to convert the TOPCOUNT() set into a string and back to a
set. Figure 3 shows the first message that the screen displays when you execute Listing 4’s query.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 5 Rock-Solid MDX 43
Listing 4: Query That Contains the C# MsgBox() Method
SELECT {[Unit Sales]} ON COLUMNS,
GENERATE( [Product Department].MEMBERS, { [Product].CURRENTMEMBER,
STRTOSET(MsgBox(“TOPCOUNT Results”, SETTOSTR(
TOPCOUNT( DESCENDANTS( [Product].CURRENTMEMBER, [Brand Name] ), 3, [Unit
Sales] )
)))
} ) ON ROWS
FROM Sales
Figure 3
The results generated by the query in Listing 4
In the C# MsgBox() method, notice that I use a counter variable to limit the number of times a
message is displayed on the screen. This limit is helpful when the MsgBox() method is called hundreds or thousands of times in a query. I could also achieve the same result by using a Cancel button
on the message box rather than a counter. When the counter in my example reaches its limit, I must
call the Reset method to restore the counter to a nonzero value so that it once again displays messages. I used the following separate MDX query to call the Reset method:
WITH MEMBER Measures.Temp AS ‘Reset(5)’
SELECT { Temp } ON COLUMNS
FROM Sales
Now I could use the MsgBox() method to figure out why the query in Listing 2 returned the
wrong result. I altered Listing 2’s query as Listing 5 shows. I used the MsgBox() method to display
what the CURRENTMEMBER of the product dimension was when the [Brands Within Dept] set was
evaluated. I learned that the [Brands Within Dept] set was evaluated only twice during the query execution instead of each time GENERATE() discovered a product department. Also, the CURRENTMEMBER was the All member (i.e., the topmost member) of the product dimension, not a product
department. This means that Analysis Services evaluates and caches a WITH SET clause for the rest of
the query execution. That’s why Listing 2’s query results were wrong.
Brought to you by Microsoft and Windows IT Pro eBooks
44
A Jump Start to SQL Server BI
Listing 5: Query That Uses the MsgBox() Method to Discover the Problem
WITH
SET [Brands Within Dept] AS ‘DESCENDANTS(
STRTOTUPLE( MsgBox(“Product CURRENTMEMBER”, TUPLETOSTR(
(Product.CURRENTMEMBER)
))).item(0)
, [Brand Name] )’
SET [Top Brands Within Dept] AS ‘TOPCOUNT( [Brands Within Dept], 3, [Unit Sales] )’
SELECT {[Unit Sales]} ON COLUMNS,
GENERATE( [Product Department].MEMBERS,{ [Product].CURRENTMEMBER, [Top Brands
Within Dept] } ) ON ROWS
FROM Sales
By designing your MDX queries one part at a time, as I demonstrated in this example with
pseudo code, you can tackle complex problems. Then, you can make sure the queries are operating
correctly by displaying the results one part at a time. I hope you find this powerful two-part process
useful for creating your own MDX.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 45
Chapter 6:
XML for Analysis:
Marrying OLAP and Web Services
By Rob Ericsson
XML for Analysis (XMLA)—a Web-service standard proposed and supported by Microsoft and leading
OLAP companies—brings together Web services and OLAP technologies by providing an XML
schema for OLAP and data-mining applications. Essentially, XMLA lets you explore and query
multidimensional data through Web services, which means analytical applications can move away
from their expensive and difficult-to-maintain client/ server roots toward a more flexible, Web-based
architecture.
XML Web services architectures connect applications and components by using standard Internet
protocols such as HTTP, XML, and Simple Object Access Protocol (SOAP). These architectures offer
the promise of interoperable distributed applications that can be shared between and within
enterprises. Amazon.com, for example, uses Web services to support associate programs that let third
parties sell from its catalog, and Microsoft’s MapPoint Web service integrates location-based services
into a variety of applications. Web services are becoming crucial pieces of enterprise application
architecture by letting you loosely couple services from disparate applications in a way that’s easy to
maintain as business processes change.
The XMLA specification, available at http://www.xmla.org/, describes the following design goals:
• Provide to remote data-access providers a standard data-access API that application developers
can use universally across the Internet or a corporate intranet to access multidimensional data.
• Optimize a stateless architecture that requires no client components for the Web and minimal
round-trips between client and server.
• Support technologically independent implementations of XMLA providers that work with any
tool, programming language, technology, hardware platform, or device.
• Build on open Internet standards such as SOAP, XML, and HTTP.
• Leverage and reuse successful OLE DB design concepts so that application developers can easily
enable OLE DB for OLAP applications and OLE DB providers for XMLA.
• Work efficiently with standard data sources such as relational OLAP databases and data-mining
applications.
By fulfilling these design goals, XMLA provides an open, industry-standard way to access
multidimensional data from many different sources through Web services—with support from multiple
vendors.
Brought to you by Microsoft and Windows IT Pro eBooks
46
A Jump Start to SQL Server BI
XMLA is based on SOAP, and you can use it from any application-programming language that
can call SOAP methods, such as Visual Basic .NET, Perl, or Java. SOAP is a lightweight, XML-based
protocol for exchanging structured and type information over the Web. Structured information
contains content and an indication of what that content means. For example, a SOAP message might
have an XML tag in it called CustomerName that contains customer name information. A SOAP
message is an XML document that consists of a SOAP envelope (the root XML element that provides
a container for the message), an optional SOAP header containing application-specific information
(e.g., custom-authentication information), and a SOAP body, which contains the message you’re
sending. Calling SOAP methods is simply a matter of wrapping the arguments for the SOAP method
in XML and sending the request to the server. Because SOAP’s overall goal is simplicity, the protocol
is modular and easy to extend to new types of applications that can benefit from Web services. You
can use Internet standards to integrate SOAP with your existing systems. Most mainstream development platforms offer some support for calling SOAP-based Web services. Both Java 2 Enterprise
Edition (J2EE) and the Microsoft .NET Framework have strong support for Web services, making the
invocation of remote services almost transparent to the developer.
Besides working with XMLA directly, you can use the Microsoft .NET-based ADO MD.NET
library to build .NET applications that use XMLA. ADO MD.NET is the successor to the OLE DB for
OLAP—based ADO MD. However, I don’t cover ADO MD.NET in this chapter. Instead, I show
you how to use the underlying XMLA protocol to build an analytic application on any device or
platform or in any language that supports XML. I assume you have some knowledge of OLAP
fundamentals, at least a passing familiarity with MDX, and some exposure to XML. For an introduction to XML Web services, see Roger Wolter’s Microsoft article “XML Web Services Basics’’ at
http://msdn.microsoft.com/library/en-us/dnwebsrv/html/webservbasics.asp. You’ll find an even more
basic and technology-neutral introduction in Venu Vasudevan’s Web services article “A Web Service
Primer” at http://webservices.xml.com/pub/a/ws/2001/04/04/webservices/index.html.
Installing XMLA
To use XMLA with SQL Server 2000, download the XML for Analysis Software Development Kit
(SDK), available at http://www.microsoft.com/downloads/details.aspx?familyid=7564a3fd-4729
-4b09-9ee7-5e71140186ee&displaylang=en, and install it on a Web server that can access your Analysis
Services data source through OLE DB for OLAP. (You can simply use the server that has Analysis
Services installed on it.) SQL Server 2005 Analysis Services will support XMLA as a native protocol, so
you won’t have to separately install XMLA. But for now, this step is necessary.
Installing the SDK is straightforward, but to run the installer, you must be logged on as an
Administrator to the machine on which you’re performing the installation. When you double-click the
XMLADSK.msi installation package, the installer walks you through the process. Unless you have a
Secure Sockets Layer (SSL) certificate configured on your Web server, you need to select Enable
HTTP and HTTPS during the Connection Encryption Settings step to allow your SQL Server unsecured
communication with the XMLA Provider through HTTP. Note that using the XMLA Provider in
unsecured mode isn’t a good idea for a production system because the provider will pass your data
across the network in plain text for anyone to intercept. But for just learning about XMLA in a
non-production environment, you’re probably OK using unsecured communication.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 47
After installing the SDK, you need to set up the data sources that you’re going to connect to
through XMLA and make the server available to clients by creating a virtual directory for the XMLA
Provider. To set up the data sources, you edit the datasources.xml file in the Config subfolder of the
installation folder you selected when installing the provider. The default path for installation is
C:\Program Files\Microsoft XML for Analysis SDK\. The datasources.xml file contains a preconfigured
example connection for the Local Analysis Server that you can copy to set up your own data sources.
Figure 1 shows part of the datasources.xml file. The most important parts of this file are the required
elements that facilitate the connection to the OLAP data source: DataSourceName for naming the data
source; DataSourceDescription for adding a text description of the data source; URL, which provides
the URL for the XMLA Provider; DataSourceInfo, which describes the OLE DB for OLAP connection
to the Analysis Servers; ProviderType, which enumerates the type or types of provider being
referenced—tabular data provider (TDP), multidimensional data provider (MDP), data-mining provider
(DMP); and AuthenticationMode (Unauthenticated, Authenticated, or Integrated), which describes how
the Web service will authenticate connections to the provider. The XML for Analysis Help file (which
you installed with the SDK at \Microsoft XML for Analysis SDK\Help\1033\smla11.chm) contains
complete information about all these configuration options.
Figure 1
Partial datasources.xml file
Brought to you by Microsoft and Windows IT Pro eBooks
48
A Jump Start to SQL Server BI
Once you’ve set up the data sources, you need to create in Microsoft IIS a virtual directory for
the XMLA Provider. The virtual directory lets IIS access a specific folder on the server through HTTP,
which is how we’ll connect to the XMLA Provider for this example. The easiest way to set up a
virtual directory is to open the IIS Manager, select the server on which you want to create the virtual
directory, right-click the Web site you want to use for the XMLA Provider, and select New, Virtual
Directory. The Virtual Directory Creation Wizard then guides you through the rest of the process. The
first step is to name the virtual directory; XMLA is usually a good choice. Next, you select the content
directory, which lets IIS map files in that directory to HTTP requests. For the XMLA Provider, the
content directory is the path to the Msxisapi.dll file installed in the C:\Program Files\Microsoft XML
For Analysis SDK\Isapi folder (the default location) during setup. Then, set the access permissions for
this folder by selecting the Read, Run Scripts, and Execute check boxes, and finish the wizard.
After you configure the virtual directory, you set access permissions on it. In IIS Manager,
right-click the virtual directory you just created and select Properties. In the Properties window, select
the Directory Security tab and configure the security permissions. For learning about how XMLA
works, the default permissions setting (anonymous access) is sufficient.
If you’re configuring the XMLA Provider on Windows Server 2003, you must take some additional
steps to enable the protocol on the server. The XMLA Help topic “Enable the XML for Analysis Web
Service Extension on Windows Server 2003” tells you how to get the XMLA Provider to work on
Windows Server 2003.
Using XMLA: Discover and Execute
One of XMLA’s greatest strengths is that it simplifies data retrieval compared to working directly with
OLE DB for OLAP. The XMLA Provider has only two methods: Discover and Execute. You use the
Discover method to retrieve metadata that describes the services a specific XMLA Provider supports.
You use the Execute method to run queries against the Analysis Services database and return data
from those queries.
Discover. Discover is a flexible method that a client can use repeatedly to build a picture of the
configuration and capabilities of the data provider. So, for example, a client might first request the list
of data sources that are available on a particular server, then inquire about the properties and
schemas those data sources support so that a developer can properly write queries against the data
source. Let’s look at the arguments you send to Discover, then walk through some examples that
show how to use the method.
Listing 1’s XML code shows a SOAP call to retrieve a list of data sources from the server. The first
parameter, RequestType, determines the type of information that Discover will return about the
provider.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 49
Listing 1 XML code using a SOAP call
<?xml version=”1.0” encoding=”utf-8”?>
<soap:Envelope
xmlns:soap=”http://schemas
.xmlsoap.org/soap/envelope/”
xmlns:xsi=”http://www.w3.org/
2001/XMLSchema-instance”
xmlns:xsd=”http://www.w3.org/
2001/XMLSchema”>
<soap:Body>
<Discover xmlns=”urn:schemas
microsoft-com:xml-analysis”>
<RequestType>DISCOVER_
DATASOURCES</RequestType>
<Restrictions/>
<Properties/>
</Discover>
</soap:Body>
</soap:Envelope>
The available types let you get a list of the data sources available on the server
(DISCOVER_DATASOURCES), a list of properties about a specific data source on the server
(DISCOVER_PROPERTIES), a list of supported request types (DISCOVER_SCHEMA_ROWSETS), a list
of the keywords the provider supports (DISCOVER_KEYWORDS), and a schema rowset constant to
retrieve the schema of a provider-defined data type. Table 1 lists the RequestType parameters.
TABLE 1: RequestType Parameters
Parameter Name
DISCOVER_DATASOURCES
DISCOVER_PROPERTIES
DISCOVER_SCHEMA_ROWSETS
DISCOVER_ENUMERATORS
DISCOVER_KEYWORDS
DISCOVER_LITERALS
Description
A list of data sources available on the server.
A list of information and values about the requested properties that the specified
data source supports.
The names, values, and other information of all supported RequestTypes
enumeration values and any additional provider-specific values.
A list of names, data types, and enumeration values of enumerators that a
specific data source’s provider supports.
A rowset containing a list of keywords reserved by the provider.
Information about literals the data source provider supports. Schema Rowset
Constant The schema rowset that the constant defines.
The second parameter, Restrictions, lets you put conditions on the data that Discover returns. The
RequestType in the call to the Discover method determines the fields that the Restrictions parameter
can filter on. Table 2 describes the fields that the various schema types in XMLA can use to restrict
returned information. If you want to return all the data available for a given RequestType, leave the
Restrictions parameter empty.
Brought to you by Microsoft and Windows IT Pro eBooks
50
A Jump Start to SQL Server BI
TABLE 2: Fields That XMLA Schema Types Can Use to Restrict Data the Discover Method Returns
Request Type
DISCOVER_DATASOURCES
Field
DataSourceName
URL
ProviderName
ProviderType
Description
The name of the data source (e.g., FoodMart 2000).
The path XMLA methods use to connect to the data source.
The name of the provider behind the data source.
An array of one or more of the provider-supported data types:
MDP for multidimensional data provider, TDP for tabular data
provider, and DMP for data mining provider.
AuthenticationMode The type of security the data source uses. Unauthenticated
means no UID or password is needed. Authenticated means
that a UID and password must be included in the connection
information. Integrated means that the data source uses a
built-in facility for securing the data source.
DISCOVER_PROPERTIES
PropertyName
An array of the property names.
DISCOVER_SCHEMA_ROWSETS SchemaName
The name of the schema.
DISCOVER_ENUMERATORS
EnumName
An array of the enumerator’s names.
DISCOVER_KEYWORDS
Keyword
An array of the keywords a provider reserves.
DISCOVER_LITERALS
LiteralName
An array of the literals’ names.
The Properties parameter provides additional information about the request that the other parameters don’t contain. For example, Timeout specifies the number of seconds the provider will wait for
the Discover request to succeed before returning a timeout message. Table 3 lists some common
XMLA Provider for Analysis Services properties you’re likely to use. You can specify properties in any
order. If you don’t specify a Properties value, Discover uses the appropriate default value.
TABLE 3: Common Properties Available in the XMLA Provider for Analysis Services
Property
AxisFormat
BeginRange
Catalog
DataSourceInfo
EndRange
Password
ProviderName
Timeout
UserName
Default
TupleFormat
Description
The format for the MDDataSet Axis element. The format can be either
TupleFormat or ClusterFormat.
-1 (all cells)
An integer value that restricts the data set a command returns to start at a
specific cell.
Empty string
The database on the Analysis Server to connect to.
Empty string
A string containing the information needed to connect to the data source.
-1 (all data)
An integer value that restricts the data set a command returns to end at a
specific cell.
Empty string
A string containing password information for the connection.
Empty string
The XML for Analysis Provider name.
Undefined
A numeric timeout that specifies in seconds the amount of time to wait for a
connection to be successful.
Empty string A string containing username information for the connection.
The Discover method call in Listing 1 returns results in XML. The settings you give the parameters RequestType, Restrictions, and Properties determine the contents of Result, which is an output
parameter. In Listing 1, note that I set RequestType to DISCOVER_DATASOURCES and Restrictions
and Properties to null so that Discover returns the entire list of data sources in the default format
(tabular format in this case). To call a SOAP method, you have to send the SOAP envelope to the
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 51
Web service through HTTP. I’ve provided a sample Web application, which you can download at
InstantDoc ID 44006. The sample application shows exactly how you might send a SOAP envelope in
JScript by using the Microsoft.XMLHTTP object in the SubmitForm() method. The sample also shows
you more examples of how to use the Discover method and how to use the data-source information
retrieved from the first call to Discover to populate the next call to Discover.
Execute. After you use Discover to determine the metadata for the data source, you can use that
metadata to retrieve data. For data retrieval, XMLA provides the Execute method. The method call for
Execute looks like this:
Execute (Command,Properties,
Results)
As Listing 2’s SOAP call to Execute shows, the Command parameter contains in a <Statement>
tag the MDX statement you want to run against your OLAP server. Similar to the Properties parameter
in the Discover method, the Properties parameter in Execute provides additional information that
controls the data the method returns or the connection to the data source. You must include the
Properties tag in your Execute method call, but the tag can be empty if you want to use the defaults
for your request. The Results parameter represents the SOAP document the server returns. Results’
contents are determined by the other two parameters.
Listing 2 SOAP call to Execute
<?xml version=”1.0” encoding=”utf-8”?>
<soap:Envelope>
<soap:Body>
<Execute xmlns=”urn:schemas-microsoft-com:xml-analysis”>
<Command>
<Statement>select {[Product].children} on rows,
{[Store].children} on columns from Sales
</Statement>
</Command>
<Properties>
<PropertyList>
<DataSourceInfo>
Provider=MSOLAP;Data Source=local
</DataSourceInfo>
<Catalog>FoodMart 2000</Catalog>
<Format>Multidimensional</Format>
<AxisFormat>TupleFormat</AxisFormat>
</PropertyList>
</Properties>
</Execute>
</soap:Body>
</soap:Envelope>
Brought to you by Microsoft and Windows IT Pro eBooks
52
A Jump Start to SQL Server BI
Listing 2’s code shows an example of a call to Execute that contains an MDX SELECT statement.
You call the Execute method the same way you call the Discover method, by sending the SOAP
envelope to the Web service through HTTP. As with any SOAP request, the entire message is
contained in a SOAP envelope. Within the SOAP envelope, the SOAP body contains the guts of the
Execute method call, starting with the Command parameter. The Command parameter contains
the MDX query that will run on the server. The Properties parameter comes next, containing the
PropertyList parameter that holds each of the properties the XML code will use for the Execute
request. In this case, the Execute call specifies in the PropertyList parameter DataSourceInfo, Catalog,
Format, and AxisFormat. You can retrieve all this information in a call to Discover like the one that
Listing 1 shows. Finally, you close the body and envelope, and the request is ready to send via HTTP
to the XMLA Provider.
Getting Results
When the XMLA Provider receives a request, it passes the request to the MDX query engine, which
parses and executes it. After obtaining the MDX results, the XMLA Provider packages them into a
SOAP reply and sends them back to the requesting client. An Execute response can be quite long
depending on the amount of data returned and the format used. To see the results of an Execute
query, load the sample application and run an MDX query. To load the sample application, simply
open it in Internet Explorer (IE). You can either copy the file to a virtual directory and open it in
HTTP or double-click the file to open it in the browser. You’ll see all the XML that the query
returned in the sample Web application; Figure 2 shows part of the results.
Figure 2
Sample Application
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 53
The SOAP response from a call to an Execute method looks similar to the results from a call to
Discover. As Listing 2 shows, the calling code includes the usual SOAP Envelope and Body tags as
the top-level wrappers, then shows the MDX query packaged for transmission in XML. You have two
options for the format of an Execute request’s results: Rowset and MDDataSet (which appears as
Multidimensional in the listing). The Rowset format is a flattened tabular structure that contains rows
and columns along with the data elements. MDDataSet is a multidimensional format that contains
three sections: OLAPInfo, Axes, and CellData. You’ll see these three sections if you scroll through the
results of the sample application. The multidimensional format represents the multidimensional data
in a hierarchical format that’s more representative of the structure of the data than the flattened
tabular format. OLAPInfo defines the structure of the results. The first section of OLAPInfo, CubeInfo,
lists the cubes where the data originated. Next, AxesInfo has an AxisInfo element for each axis in the
data. Every AxisInfo element contains the hierarchies, members, and properties for that axis. AxisInfo
always contains the standard properties Uname (Unique Name), Caption, Lname (Level Name), and
Lnum (Level Number). In addition, AxisInfo might contain a default value specified for cell properties.
If the query results include many repeating values, these default values can dramatically reduce the
size of the returned data by returning only the data elements that are different from the default. Last,
the CellData section of a multidimensional format contains CellInfo standard and custom properties
for each cell the MDX query returns. The standard properties are Value, FmtValue (Format Value),
ForeColor, and BackColor. Optional properties depend on the MDX query you use to retrieve the
results.
Describing XMLA results in abstract terms is difficult because the exact data returned varies
depending on the query you use. The easiest way to understand OLAPInfo is to walk through an
example of the results from a specific query. Consider the following MDX query:
select
{[Product].children} on rows,
{[Store].children} on columns
from Sales
Running this query through the XMLA Provider by using the Execute method results in the
AxesInfo section that Figure 3 shows. The query returns columns (Axis0) and rows (Axis1). Each
axis contains only one hierarchy: The columns axis contains the Store hierarchy, and the rows axis
contains the Product hierarchy. After defining the dimensional axes, Figure 3 shows the slicer
dimension, which is an MDX dimension for filtering multidimensional data. Slicer dimensions appear
in the WHERE clause of an MDX query and display every hierarchy in the cube that doesn’t appear
in the dimensional axes. The repetition of this information is useful in XMLA because you can use the
information to show which other hierarchies are available in a given cube and write further queries
against those hierarchies.
Brought to you by Microsoft and Windows IT Pro eBooks
54
A Jump Start to SQL Server BI
Figure 3:
AxesInfo section resulting from the Execute call
<AxesInfo>
<AxisInfo name=”Axis0”>
<HierarchyInfo name=”Store”>
<UName name=”[Store].[MEMBER_UNIQUE_NAME]” />
<Caption name=”[Store]
.[MEMBER_CAPTION]” />
<LName name=”[Store]
.[LEVEL_UNIQUE_NAME]” />
<LNum name=”[Store].[LEVEL_NUMBER]” />
<DisplayInfo name=”[Store]
.[DISPLAY_INFO]” />
</HierarchyInfo>
</AxisInfo>
<AxisInfo name=”Axis1”>
<HierarchyInfo name=”Product”>
<UName name=”[Product].[MEMBER_UNIQUE_
NAME]” />
<Caption name=”[Product]
.[MEMBER_CAPTION]” />
<LName name=”[Product]
.[LEVEL_UNIQUE_NAME]” />
<LNum name=”[Product]
.[LEVEL_NUMBER]” />
<DisplayInfo name=”[Product]
.[DISPLAY_INFO]” />
</HierarchyInfo>
</AxisInfo>
<AxisInfo name=”SlicerAxis”>
<HierarchyInfo name=”Measures”>
...
</HierarchyInfo>
<HierarchyInfo name=”Time”>
...
</AxisInfo>
</AxesInfo>
As I noted earlier, the last part of the OLAPInfo section of a multidimensional format, CellInfo,
describes the properties the query will return for each cell in the result set. Because the query I use
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 55
in this example doesn’t specify any additional properties, the CellInfo section displays only the basic
Value and FmtValue information:
<OlapInfo>
<!- the AxesInfo goes here - >
<CellInfo>
<Value name=”VALUE” />
<FmtValue name=”FORMATTED_VALUE” />
</CellInfo>
</OlapInfo>
The next section of the results in MDDataSet format is Axes, which contains the data the query
returns organized in either TupleFormat, as Figure 4 shows, or Cluster-Format.
Figure 4
Query data organized in TupleFormat
Brought to you by Microsoft and Windows IT Pro eBooks
56
A Jump Start to SQL Server BI
Let’s look at an example to see the differences between these two formats. Say you have three
country categories (Canada, Mexico, and USA) and three product categories (Drink, Food, and
Non-Consumable), which produce nine combinations of countries and products. Logically, you
have several options for representing this set in a written notation. First, you can simply list the
combinations:
{(Canada, Drink), (Canada, Food), (Canada, Non-Consumable),
(Mexico, Drink), (Mexico, Food), (Mexico, Non-Consumable),
(USA, Drink), (USA, Food), (USA, Non-Consumable)}
This is the kind of set representation that the TupleFormat uses. Each pair is a tuple, and each
tuple contains a member from each dimension you included in the results. So if you had three
dimensions in the query, the resulting tuple would have three members.
Alternatively, you can use a mathematical representation of the combinations of the two sets.
Using the concept of a Cartesian product, you can represent the set of data as:
{Canada, Mexico, USA} x {Drink, Food, Non-Consumable}
The Cartesian product operator (x) between the two sets represents the set of all possible
combinations of the two sets. The ClusterFormat uses this representation. And although this is a much
more compact representation, it requires more interpretation to understand and navigate.
The last section in MDDataSet is CellData, which contains values for each cell the MDX query
returns. An ordinal number in a zero-based array refers to the cells. (To learn how to calculate
ordinal numbers, see the Web sidebar “Mapping the Tuple Ordinals” at InstantDoc ID 44007.) If a cell
isn’t present in the array, the default value from AxisInfo serves as the value for the cell. If no default
value is specified, the value is null.
A Convenient Marriage
This chapter has introduced XMLA as a Web services layer that uses SOAP to tap into OLAP data.
XMLA provides the basis for standards-based, Internet-ready analytic applications, which can be easily
deployed and shared across and among enterprises. By using the XML for Analysis SDK, you can use
XMLA today in SQL Server 2000 Analysis Services (or in other vendors’ platforms), and XMLA will be
a core part of the SQL Server 2005 Analysis Services platform. With its flexibility and broad support,
XMLA is an excellent tool for current or future analytic application projects.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 7 Improving Analysis Services Query Performance 57
Chapter 7:
Improving Analysis Services Query
Performance
By Herts Chen
Analysis Services is a high-performance, multidimensional query engine for processing analytical and
statistical queries, which a relational SQL engine doesn’t handle well. When such queries are simple
or have pre-aggregations, Analysis Services can make your job easier. But when queries become
complex, Analysis Services can bog down. For example, an SQL SELECT statement that includes a
GROUP BY clause and aggregate functions can take as long as a few minutes—or more. You can
retrieve the same result set in just a few seconds if you execute an MDX statement against an
Analysis Services Multidimensional OLAP (MOLAP) cube. You perform this workaround by passing
an MDX query from SQL Server to a linked Analysis Server by using the OPENQUERY function in
an SQL SELECT statement, as SQL Server Books Online (BOL) describes. Analysis Services then
precalculates the necessary aggregations during the processing and creation of the MOLAP cube so
that the results are completely or partially available before a user asks for them.
However, precalculating every imaginable aggregation is impossible; even a completely processed
MOLAP cube can’t precalculate aggregations such as those in calculated cells, calculated members,
custom rollup formulas, custom member formulas, FILTER statements, and ORDER statements. If
you’re used to the performance you get when you retrieve only precalculated aggregations, the
performance you get from an MDX query that involves these kinds of runtime calculations might
seem unbearably slow. The problem might occur not because Analysis Services can’t handle runtime
calculations efficiently but because your MOLAP cube’s design isn’t optimal.
In my work building and maintaining a data warehouse for the city of Portland, Oregon, I
optimize Analysis Services so that traffic engineers can quickly access a variety of statistics about
traffic accidents in the city. Through many experiments, I’ve discovered that an important key to
MOLAP optimization is cube partitioning. In this chapter, I explore and compare various MOLAP
cube-partitioning strategies and their effects on query performance. Then, I suggest some simple
guidelines for partition design.
Traffic-Accident Data Warehouse
My study of query performance is based on my work with a real dimensional data warehouse that
maintains traffic-accident history. When I conducted this study, the traffic-accident data warehouse
contained 17 years of data (1985 through 2001) and documented about 250,000 unique incidents. The
complex part of this data warehouse is not its relatively small fact table but its many dimensions,
which the snowflake schema in Figure 1 shows.
Brought to you by Microsoft and Windows IT Pro eBooks
58
A Jump Start to SQL Server BI
Figure 1
The data warehouse’s many dimensions
Portland’s traffic engineers look for the street intersections that have the highest number of
incidents. Then, they search for clues about which factors might cause a high number of crashes and
what makes some accidents more severe than others. They look at a total of 14 factors (which are
the data warehouse’s dimensions) including time, light, weather, traffic control, vehicle, and severity
of occupant injuries. Among the dimensions, the Streets dimension (STREET_DIM) is the largest; it
records roughly 30,000 street intersections in the Portland area. The total number of source records to
build aggregations on is the result of a multi-way join of 14 one-to-many (1:M) or many-to-many
(M:N) relationships from the fact table to the dimension tables. The Accident data warehouse contains
only one measure: the distinct accident count (Incident_Count). A distinct count prevents the
possibility of counting the same accident multiple times in a M:N relationship.
Fortunately, the Streets dimension isn’t too large to use MOLAP cube storage, which provides the
best query performance. Analysis Services defines a huge dimension as one that contains more than
approximately 10 million members. Analysis Services supports huge dimensions only with Hybrid
OLAP (HOLAP) or Relational OLAP (ROLAP) cubes.
Queries and Bottlenecks
Analysis Services responds to queries with varying performance, depending on the complexity of the
query. For example, a MOLAP cube that you create by accepting the default partition in Analysis
Manager would respond to a simple query like the one that Listing 1 shows by returning roughly
2000 records in only 5 seconds.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 7 Improving Analysis Services Query Performance 59
Listing 1: Simple Query That Doesn’t Use Calculated Members
SELECT { [Occupant_Severity].[All Occupant_Severity] } ON COLUMNS ,
{ORDER( { FILTER( [Street].[Street_List].Members,
[Occupant_Severity].[All Occupant_Severity] >= 20 ) }, [Occupant_Severity].[All
Occupant_Severity], BDESC ) } ON ROWS
FROM [Default1]
If your queries basically ask only for pre-aggregates in a few records or columns, any MOLAP
cube with any percentage of aggregation—even as little as 5 percent—will perform well. However,
for a query like the one that Listing 2 shows, which involves six calculated members, a 30 percentaggregated, single-partition MOLAP cube would take 52 seconds to return just 331 street intersections.
These disparate results suggest that performance bottlenecks don’t depend on the size of the result
set or on the percentage of aggregation in the cube. In fact, in my experience, any aggregations
beyond 30 percent are a waste—you get no better performance for your effort. For simple queries,
you don’t need high aggregation. For complex queries, high aggregation won’t help. Performance
bottlenecks in Analysis Services typically come from calculated members that scan for multiple tuples
and aggregate them on the fly.
Listing 2: Complex Query Containing 6 Calculated Members
-- Returns 7 columns in 331 records.
WITH MEMBER Time.[Accident_Count] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]},
[Occupant_Severity].[All Occupant_Severity])’
MEMBER Time.[Fatal] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]},
[Occupant_Severity].&[Fatal])’
MEMBER Time.[Injury_A] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]},
[Occupant_Severity].&[Injury A, Incapacitating])’
MEMBER Time.[Injury_B] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]},
[Occupant_Severity].&[Injury B, Non-Incapacitating])’
MEMBER Time.[Injury_C] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]},
[Occupant_Severity].&[Injury C, Possible Injury])’
MEMBER Time.[PDO] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]},
[Occupant_Severity].[PDO])’
SELECT {Time.[Accident_Count], Time.[Fatal], Time.[Injury_A], Time.[Injury_B],
Time.[Injury_C],
Time.[PDO]} ON COLUMNS ,
{ORDER( { FILTER([Street].[Street_List].Members,
(Time.[Accident_Count]) >= 20 ) }, (Time.[Accident_Count]), BDESC )} ON ROWS
FROM [Default1]
The City of Portland traffic engineers I work with typically ask ad hoc questions that are
nonhierarchical along the Time dimension. For example, an engineer might ask me to calculate the
total number of accidents during the past 3 years, the past 5 years, or any combination of years
between 1985 and 2001. I can’t simply aggregate years by creating a new level above the Year level
in the Time dimension; the new level would satisfy only one combination of years. This limitation
Brought to you by Microsoft and Windows IT Pro eBooks
60
A Jump Start to SQL Server BI
means all queries that involve a combination of years have to use calculated members to perform
aggregations for the specified years.
Listing 2’s query returns accident counts along the Time, Occupant_Severity, and Streets
dimension members. Figure 2 shows the members of the Time and Occupant_Severity dimensions.
Figure 2
Members of the Time and Occupant_Severity dimensions
Listing 2’s query uses six calculated members—Accident_Count, Fatal, Injury_A, Injury_B,
Injury_C, and PDO (Property Damage Only)—to sum the accidents in the years 1998, 1999, and 2000
for each of the five members of the Occupant_ Severity dimension. The query asks for a sorted and
filtered result set of accident counts for each street intersection ([Street].[Street_List]) in each of these
six calculated members. To contrast with the performance of such on-the-fly aggregation, I’ve
included Listing 3, which accesses only pre-aggregations and doesn’t include calculated members. I
used Listing 2 and Listing 3 as the benchmarks for my cube partitioning tests, which I discuss in a
moment.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 7 Improving Analysis Services Query Performance 61
Listing 3: Query That Returns the Same Columns as Listing 2 Without Using Calculated
Members
-- Returns 7 columns in 2239 records.
SELECT { [Occupant_Severity].[All Occupant_Severity],[Occupant_Severity].&[Fatal],
[Occupant_Severity].&[Injury A, Incapacitating],
[Occupant_Severity].&[Injury B, Non-Incapacitating],
[Occupant_Severity].&[Injury C, Possible Injury],
[Occupant_Severity].&[PDO]} ON COLUMNS,
{ORDER( { FILTER( [Street].[Street_List].Members,
[Occupant_Severity].[All Occupant_Severity] >= 20 ) }, [Occupant_Severity].[All
Occupant_Severity], BDESC ) } ON ROWS
FROM [Default1]
When you need to improve the performance of queries that involve calculated members, cube
design is important. In my experience, the most important aspect of cube design isn’t how much
memory you have, the number of disks or CPU threads you have, whether you use unique integers
for member keys, or even whether you use the Usage-Based Optimization Wizard, but how you
partition the cube.
Partitioning is slicing a cube along a tuple such as ([Occupant_Severity].[Fatal], [Time].[2000]),
which specifies a member from each dimension. For any dimension that you don’t specify in this
tuple, the partition includes the entire dimension. Analysis Services keeps in the cube structure a
direct pointer or index to the partition for that tuple. Whenever a query references that tuple or a
subset of it, Analysis Services can get to the corresponding partition without scanning the entire cube.
You can partition a cube in a nearly infinite number of ways, and Analysis Services supports as many
partitions as you practically need for a cube. But without a clear rule for creating partitions, you
could create too many cuts or wrong cuts on a cube and end up with worse performance than you’d
get with one default partition.
Usage-Based Partitioning
You can partition a cube along any tuple of members at any level from any dimension. Analysis
Services’ Partition Wizard calls such a tuple a data slice. Although Analysis Services can scan a small
partition faster than a large one, a small partition contains fewer members. Analysis Services might
have to perform more scans of multiple small partitions to cover the same number of members that
one larger partition could contain. So the overhead of performing calculations on the results of
multiple partitions might negate the advantage of the faster scan in each smaller partition.
How you partition a cube depends on the queries you need to process. Logically, you might
decide to partition along every tuple that a query specifies. For example, to improve Listing 2’s
performance, you might be tempted to partition along each tuple of the cross join of {Time.[1998],
Time.[1999], Time.[2000]}, [Occupant_Severity].Members (6 members), and [Street].[Street_List].Members
(roughly 30,000 members). You’d create partitions along a total of 540,000 tuples (3 x 6 x 30,000 =
540,000). This seemingly simple plan creates two problems. First, scanning 540,000 partitions and
summing the 3 years for each tuple of severity and street (a total of 180,000 tuples) would create
significant performance overhead. Second, the amount of work and time to create and process
Brought to you by Microsoft and Windows IT Pro eBooks
62
A Jump Start to SQL Server BI
540,000 partitions, manually or programmatically by using Decision Support Objects (DSO), is
astronomical.
The excessive performance overhead you create when you partition along every tuple in a query
is a serious concern for a couple of reasons. First, the query in Listing 2 isn’t supposed to return each
year individually. Instead, the query should return only the sum of incidents in 3 years. An efficient
partition would include the three specified years so that Analysis Services could calculate the sum
solely within the partition. Second, the query doesn’t need to access just one street intersection; it has
to scan all the street intersections regardless of the partitions you create. Being able to get to a
particular street partition directly doesn’t improve performance because you still have to walk through
every partition. You’d be better off keeping all the street intersections in the same partition. The
bottom line is that you should partition along a tuple only when the partition can save your query
from doing a sequential scan for that tuple.
Partition Testing
To see what kinds of partitions avoid a sequential scan, I devised tests that use Listing 2 and Listing 3
as benchmarks. In the rest of this chapter, I summarize the tests and some important results.
I created six cubes of identical structure with 30 percent aggregation and varying partition
designs. I wanted to partition along the Time and Occupant_Severity dimension members (which
Figure 2 shows) that the test queries in Listing 2 and Listing 3 are looking for so that they can get to
those members with no scan or a faster scan. Table 1 describes the partitioning data slices of these
six test cubes. I gave the cubes names that reflect their partitioning dimensions and total number of
partitions.
TABLE 1: Test Cubes and Their Partitioning Data Slices
Cube
Default1
Severity6
PartitionYear2
Year6
PartitionYear_Severity7
Year_Severity31
Partitioning Data Slice (Tuple)
Entire cube in one default partition
Partition at each of the [Severity Header].Members tuples—for example, ([Fatal]), ([PDO])
Partition at each of the [Partition Year].Members tuples—for example, ([1]), ([2])
Partition at the [Partition Year].[1] tuple and each of the [Partition Year].[2].Children
tuples—for example, ([1997]), ([1998])
Partition at the [Partition Year].[1] tuple and each of the CrossJoin({[Partition Year].[2]},
[Severity Header]. Members) tuples—for example, ([2], [Fatal]), ([2], [PDO])
Partition at the [Partition Year].[1] tuple and each of the {CrossJoin([Partition
Year].[2].Children, [Severity Header]. Members) tuples—for example, ([1997], [Fatal]),
([1997], [PDO])
To study the effect of the number and speed of CPUs, disk I/O, and physical memory on
partitioned cubes, I repeated the same tests on six different Dell servers. Table 2 shows the specifications for these servers, ranging from the highest end to the lowest end in hardware resources. High1,
High2, and High3 are high-end production-scale servers; Low1 and Low2 are desktops; and Low3 is a
laptop (which I used as a server for the sake of testing). Each test executes Listing 2 and Listing 3
from the client machine Low2 against every test cube on all six servers.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 7 Improving Analysis Services Query Performance 63
TABLE 2: Test Server Specifications
CPU (MHz)
# of CPUs
Average disk speed (MB/sec)
RAM (GB)
High1
549
8
11
4
High2
549
8
10
4
High3
499
4
16
1
Low1
1994
2
7
4
Low2
1694
1
8
0.5
Low3
1130
1
8
0.5
All the tests measured the response times of Listing 2 and Listing 3. Figure 3 shows Listing 2’s
performance on all the servers.
Figure 3
Listing 2’s performance
•
•
•
•
•
I drew the following conclusions for Listing 2:
High-end servers (with multiple low-speed CPUs) performed worse than low-end servers (with
one high-speed CPU) regardless of cube partitioning. CPU speed—rather than the number of
CPUs, disk speed, or amount of memory—drives performance.
Effective partitioning makes the query perform 5 to 10 times faster than on the default partition,
especially on slower CPUs.
Queries that have calculated members, such as the one in Listing 2, are CPU-bound.
Partitioning along queried data slices, as I did in the Year_Severity31 and PartitionYear_Severity7
test cubes, gives the best performance.
Slicing along queried members (e.g., slicing along the six members of the Severity dimension and
the three members of the Year level of the Time dimension) prevented sequential scans.
Brought to you by Microsoft and Windows IT Pro eBooks
64
A Jump Start to SQL Server BI
• Minimizing partition sizes by excluding members that you don’t query frequently (e.g., [Partition
Year].[1], which includes the years 1985 through 1996) doesn’t prevent a sequential scan but does
speed up the scan.
• Test results show that an aggregation level higher than 5 percent has no effect on performance,
which proves my hypothesis that high aggregation levels are a waste of effort.
Guidelines for Partitioning
Based on the results of my tests and the conclusions I’ve drawn, I offer these partition- design
guidelines. For all queries:
• Never overlap partitions.
• Never specify the [All] member as the partition data slice because doing so creates overlapping
partitions.
For queries like the one in Listing 3, which accesses only pre-aggregations:
• No partitioning is necessary because its effect is negligible or negative.
• Apply Analysis Services’ Usage-Based Optimization.
For queries like the one in Listing 2, which calculates many aggregations on the fly:
• Partition along queried data slices—for example, ([Partition Year].[2]
.[1997], [Fatal]).
• No Usage-Based Optimization is necessary because it has no effect.
• Five percent aggregation is the maximum aggregation level that provides performance
improvements.
If you have multiple slow queries that have different partitioning needs, consider creating
different cubes for each query. For desktop ad hoc users who can retrieve just one screen of results
at a time, using multiple cubes might be inconvenient. However, for custom applications (such as
Web and reporting applications) that need complete results, you have the full control of accessing
multiple cubes or even multiple databases behind the scenes for the best performance.
The term “tuning” implies that you’ll have to experiment to achieve the optimal performance for
your system. The techniques and guidelines that this chapter offers won’t necessarily create optimal
performance right away, but if you take the time to examine your query usage and identify the slow
queries, estimate which partitions might prevent sequential scans, and test those partitions, you’ll get
closer to the performance you want.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 8 Reporting Services 101 65
Chapter 8:
Reporting Services 101
By Rick Dobson
SQL Server 2000 Reporting Services, the SQL Server-based enterprise reporting solution Microsoft
released in January 2004, is positioned to become one of the most popular SQL Server components.
Nearly all organizations need to produce reports from their data, and with Reporting Services,
Microsoft filled this large hole in SQL Server’s toolkit. You can install Reporting Services on any SQL
Server 2000 computer at no additional cost, and you’ll be able to install it as part of SQL Server 2005.
In spite of the solution’s benefits and the excitement surrounding its initial release, many SQL
Server professionals have limited or no hands-on experience with Reporting Services. If you’re like
many database professionals, you might have put off using Reporting Services because of its relative
newness, the fact that it requires a separate SQL Server installation that works along with your
production SQL Server, or maybe its list of prerequisites. But Reporting Services isn’t so new any
more, and Microsoft has released Reporting Services Service Pack 1 (SP1), which fixes the bugs in the
initial release. In addition, Microsoft is integrating Reporting Services with SQL Server 2005, so
learning how to use Reporting Services now will give you a head start on SQL Server 2005. This
chapter gives you the basics for getting started with Reporting Services and includes SP1 examples
that you can reproduce in your test environment. I start by giving you the prerequisites for using
Reporting Services and explaining where to get it. Then, I walk you through the steps for authoring
two reports and for deploying those reports to the Report Server, Reporting Services’ main component. Finally, I teach you two ways to view deployed reports.
Installing Reporting Services
To properly install Reporting Services, your system needs four elements. First, you need Windows
Server 2003, Windows XP, or Windows 2000 with the most recent service packs installed. Second,
you need Microsoft IIS because Reporting Services runs as an XML Web service. Third, you need the
standard, enterprise, or developer edition of SQL Server 2000. (Reporting Services isn’t compatible
with earlier SQL Server releases.) Fourth, report designers need Visual Studio .NET 2003, which hosts
Reporting Services’ Report Designer component. (For administrators who don’t design reports,
Reporting Services provides a different UI that permits the creation of folders, data sources, and users
and the assignment of permissions to users.)
After you make sure your system meets the prerequisites, you can install Reporting Services, then
install SP1 to update the initial release. You can download a trial version of Reporting Services at the
URL in Related Reading.
Creating Your First Report
The only Report Designer Microsoft offers for authoring Reporting Services reports is in Visual Studio
.NET 2003. When you install Reporting Services, the installation process automatically updates Visual
Studio .NET by adding a new project type called Business Intelligence Projects. You don’t necessarily
Brought to you by Microsoft and Windows IT Pro eBooks
66
A Jump Start to SQL Server BI
need to have Visual Studio .NET installed on the same server as Reporting Services. As I explain in a
moment, you can reference a target-server URL for Reporting Services, which can be different from
the location of the workstation you use to run Visual Studio .NET. Within this project type are two
templates named Report Project Wizard and Report Project. Both templates let you perform the steps
to create a report: defining a report’s data source, specifying a report’s layout, previewing a report,
and deploying a finished report to the Report Server.
To create your first report, start a new Business Intelligence project in Visual Studio .NET, and
choose the Report Wizard Project template. Name your project SSMRS-Intro. Read the wizard’s
welcome screen, then click Next to go to the Select the Data Source screen and specify the report’s
data source. Click Edit to open the familiar Data Link Properties dialog box that Figure 1 shows. On
the dialog box’s Provider tab, select Microsoft OLE DB Provider for SQL Server as the type of data you
want to connect to.
Figure 1
Data Link Properties dialogue box
As Figure 1 shows, the dialog box’s Connections tab lets you specify on the local SQL Server
instance a Windows NT Integrated security-based connection to the Northwind database. Click Test
Connection, then click OK to return to the Select the Data Source screen, which now shows a
connection string that points to a data source named after the database. Note that unless you select
the Make this a shared data source check box at the bottom of the screen, the wizard embeds the
data source so that you can use it exclusively for this one report.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 8 Reporting Services 101 67
Clicking Next opens the wizard’s Design the Query screen. You can either type an SQL query
statement into the Query string text box or click Edit to open a graphical query designer that operates
like the query builder in Enterprise Manager. For this example, you can use the following query:
SELECT
FROM
WHERE
CompanyName,
Customers
(Country =
(Country
(Country
ContactName, Phone, Country
‘Canada’) OR
= ‘Mexico’) OR
= ‘USA’)
Then, click Next to open the Select the Report Type screen. The wizard offers two report types:
tabular and matrix. The matrix type is for a cross-tab report, which we won’t create in this chapter’s
examples. For this demonstration, select Tabular.Figure 2 shows the next wizard screen, Design the
Table, which lets you put the query fields where you want them in the report. Click Details to move
the field names from the Available fields list box to the Details list box. These selections cause the
fields to appear in a report’s Details section. You can optionally create additional groupings around
the Details section by adding fields to the Group list box. Clicking Next opens the Choose the Table
Style screen. You can accept the default selection of Bold or highlight one of the other report styles.
A preview window gives you a feel for how the different styles present your data.
Figure 2
Design the Table wizard screen
Brought to you by Microsoft and Windows IT Pro eBooks
68
A Jump Start to SQL Server BI
When you’re running the Report Wizard for the first time in a project, the Choose the Deployment
Location screen appears next. The wizard automatically populates the Report Server and Deployment
folder text boxes. Because the Report Server for this chapter’s examples runs from the local IIS Web
server, the Report Server text box shows the path http://localhost/ReportServer. During installation,
you specify the name of the Web server that hosts Reporting Services. By default, the wizard names
the deployment folder after the project’s name—in this case, SSMRSIntro.
The final wizard screen assigns a default name to the report and shows a summary of the
selections from the previous screens. The initial default report name in a project is Report1. When
you’re creating your own reports, you can change the default name to something more meaningful.
After you close the wizard, you’re in the Visual Studio .NET report-design environment. Each
report has three tabs: one to specify its data source, another for its layout, and a third to preview
how it displays data. Figure 3 shows part of the Preview tab for Report1, which shows how the
report will look after you deploy it. Report1 is for one specific data source, but Reporting Services lets
you use parameters to vary the output in a report.
Figure 3
Partial Preview tab for Report1
Creating a Drilldown Report
For your second report, let’s use a shared data source instead of an embedded one, as you did to
create Report1. A shared data source is useful because you can reuse it in multiple reports. Start by
right-clicking Shared Data Sources in the Solution Explorer, which you see in Figure 3’s right pane,
then choosing Add New Data Source to open a Data Link Properties dialog box like the one that
Figure 1 shows. Complete the dialog box to specify Northwind as the data source, as you did for
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 8 Reporting Services 101 69
Report1. This process adds a new entry with the name Northwind.rds nested below Shared Data
Sources in the Solution Explorer.
Open the Report Wizard by right-clicking Reports in the Solution Explorer and choosing Add
New Report. In the Select the Data Source screen, the wizard automatically selects Northwind as the
database, referring to the Northwind.rds shared data source. If you had more than one shared data
source, you could open the Shared Data Source drop-down box and select another shared data
source.
For the second report, enter the same query that you used for Report1 and select a tabular report
style. In the Design the Table screen, add Country to the Group list box, and add CompanyName,
ContactName, and Phone to the Details list box. Because you selected an item for the Group list box,
a new screen called Choose the Table Layout appears before the Choose the Table Style screen. The
table layout screen includes a check box called Enable drilldown. (You must select the Stepped
button to make the Enable drilldown check box available.) Select Enable drilldown so that CompanyName, ContactName, and Phone column values will appear only after a user drills down to them by
expanding a Country column value. Click Finish, and accept Report2 as the second report’s name.
Figure 4 shows how Report2 looks in the Preview tab. Clicking the expand icon (+) next to a
country name drills down to the fields nested within the group value and changes the + to a -.
Notice that in Figure 4, you can view the CompanyName, ContactName, and Phone column value for
the customers in Mexico, but not for either of the other two countries. Clicking the expanders for
either of the other two countries will expose their hidden nested column values.
Figure 4
Viewing Report2 in the Preview tab
Brought to you by Microsoft and Windows IT Pro eBooks
70
A Jump Start to SQL Server BI
Deploying a Solution
In Reporting Services, deploying a solution is the process of publishing the reports, shared data
sources, and related file items from a Visual Studio .NET project to a folder on a Report Server.
Administrators can set permissions to restrict user access to reports and other solution items (e.g.,
shared data sources) on a Report Server.
When you right-click a project in the Solution Explorer and invoke the Build, Deploy Solution
command from a Visual Studio .NET project, you publish items from a solution to a folder on a
Report Server. The first time you run the Report Wizard, the folder’s name and the Report Server URL
appear on the Choose the Deployment Location screen. If the folder’s name doesn’t exist on a Report
Server when a report author invokes the Build, Deploy Solution command, Report Server creates a
new folder.
You can view and update the deployment folder and Report Server URL settings from a project’s
Property Pages. Right-click the project name in the Solution Explorer pane and choose Properties to
open a project’s Property Pages dialog box. The TargetFolder setting corresponds to the deployment
folder for a project, and the TargetServerURL setting contains the URL for the Report Server that hosts
a solution’s target folder. Figure 5 shows the Property Pages dialog box for the SSMRSIntro example
project. Alternatively, you can change a report’s deployment location by using the Reporting Services
Report Manager application after you publish the report.
Figure 5
Property Pages dialogue box
Viewing Deployed Solution Items
After you deploy reports and related items from a project to a Report Server, you can view them in
one of two ways. First, you can use URL access to read the contents of reports with read-only
permissions. Second, you can invoke Report Server for a richer mix of capabilities, including
Reporting Services administration. Both approaches require a Windows account on the local Windows
server or a Windows account from another trusted Windows server. Administrators have unlimited
permissions, including assigning users to predefined and custom roles with permissions to perform
tasks, such as reading a report.
Brought to you by Microsoft and Windows IT Pro eBooks
Section I: Essential BI Concepts — Chapter 8 Reporting Services 101 71
Connecting to Report Server through URL access. You can connect to a Report Server by
navigating to its URL address from any user account that has permission to connect to it. For
example, the IIS server hosting the Reporting Services Report Server in my office is called cab233a.
Other computers in my office can connect to the Report Server at the URL http://cab233a/ReportServer. A user who has an authorized user account can navigate a browser to this URL and view a
page showing links to folders on the Report Server. The link for the SSMRSIntro folder opens a Web
page containing links for the two example reports in this chapter and the shared data source. The
links are named after the item names in the SSMRSIntro project; the Report1 link opens Report1 in
the browser.
Figure 6 shows an excerpt from the URL-accessed view of Report1. Notice that the report appears
the same as it does in Figure 3, but the Address box shows a URL that contains a command to
render the report (rs:Command=Render). In addition, the Select a format drop-down box near the top
of the pane lets users save the report in a variety of useful formats. For example, selecting Acrobat
(PDF) file from the drop-down box lets users save a local copy of the report in PDF format for offline
use.
Figure 6
Excerpt from URL-accessed view of Report1
Brought to you by Microsoft and Windows IT Pro eBooks
72
A Jump Start to SQL Server BI
Invoking Report Server. Users who have appropriate permissions can connect to the Report
Server by navigating to http://servername/reports. For this chapter’s examples, the server name is
cab233a. Figure 7 shows a connection to the cab233a Report Server and a folder list in the Home
folder. Clicking any folder (e.g., SSMRSIntro) in a Home folder reveals the clicked folder’s contents.
Users can use the Report Server folders to perform tasks according to the role assignments for their
Windows account and any Windows groups they belong to. An administrator has all possible
permissions. Report Server automatically adjusts its UI to expose permissions and items consistent
with the role of each user.
Figure 7
Invoking Report Server
Beyond the Basics
Reporting Services is Microsoft’s first entry into the enterprise reporting platform market. I like
Reporting Services because it’s easy to install and use. Reporting Services will be even more tightly
integrated in SQL Server 2005. Learning it now will help you later as you start learning SQL Server
2005. As you work with Reporting Services you’ll discover that its capabilities go far beyond what I
cover in this tutorial, but you can use the information in this chapter as a first step to expanding your
enterprise reporting capabilities.
Brought to you by Microsoft and Windows IT Pro eBooks
73
Section II
BI Tips and Techniques
Brought to you by Microsoft and Windows IT Pro eBooks
74
A Jump Start to SQL Server BI
Improve Performance at the Aggregation Level
You can improve OLAP performance when you set a cube’s aggregation level. When you build a
cube, you set the aggregation level according to the desired speedup in processing queries. (Speedup
describes how much faster queries run with precreated aggregations than without aggregations.)
The system estimates the speedup based on the I/O amount that the system requires to respond to
queries. The total possible number of aggregations is the product of the number of members from
each dimension. For example, if a cube has two dimensions and each dimension has three members,
then the total possible number of aggregations is 9 (3 x 3). In a typical cube, the number of
aggregations possible is extremely large, so calculating all of them in advance isn’t desirable because
of the required storage space and the time it takes to create the aggregations. Imagine a cube with
four dimensions, each with 10,000 members. The total possible number of aggregations is 1016.
When you tell SQL Server 7.0 OLAP Services to calculate aggregations for a 20 percent speedup,
OLAP Services picks key aggregations (which are distributed across the cube) to minimize the time
required to determine any other aggregations at query time.
—Russ Whitney
Using Children to Automatically Update Products
Let’s say you want to write an MDX query that shows sales for all hot beverage products for each
month of the year. That task sounds simple enough, but what if you add and remove products from
your product list each month? How would you write the query so you don’t have to update it every
time you update your list of products? Here’s a trick to help: Use the descendants or children
function. The example query that Listing 1 shows uses both of these functions. Try running Listing 1’s
query in the MDX Sample program. The descendants and children functions are powerful.
—Brian Moran and Russ Whitney
Listing 1: Code That Uses the Descendants and Children Functions
SELECT Descendants([Time].[1998],[Time].[Month]) ON COLUMNS,
[Product].[AllProducts].[Drink].[Beverages].
[Hot Beverages].Children ON ROWS
FROM Warehouse
Saving DTS Information to a Repository
To save Data Transformation Services (DTS) information into the Microsoft Repository, choose SQL
Server Repository as the location for saving the package. Then, use the Advanced tab on the Package
Properties to set the scanning options, which Figure 1 shows. Doing so causes DTS to call the OLE
DB scanner to load all source and target catalogs into the Repository. If you don’t set the scanning
options, DTS creates DTS Local Catalogs as the reference for all source and target catalogs, which can
make locating the databases impossible. Each subsequent save replicates this reference, so you can’t
keep comments and other descriptive information updated.
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 75
Figure 1
Package Properties Advanced tab
You can run into problems when you try to save certain DTS transformations to a repository. If
you use a script to perform a simple transformation and you choose the source columns explicitly
(not from a query), all the transformation data is captured, as you can see in the transformation
model in “The Open Information Model,” March 2000, InstantDoc ID 8060. If you choose a query as
the transformation source, that source becomes objects that aren’t part of the OLE DB imported data.
This choice makes following the connection back to the true source objects difficult. Also, the query
isn’t parsed to create a connection between the query columns and the columns you select the data
from. So in many cases, the connection between source and target is available, but in some, it isn’t.
You can solve these problems by writing a program to resolve the references in a repository or by
using a custom model along with the DTS model to store the source target mappings.
—Patrick Cross and Saeed Rahimi
Intelligent Business
I knew nothing about business intelligence (BI) until I sat through a session about a new feature
tentatively called the d-cube (for data cube) during the developer’s conference several years ago for
the beta version of SQL Server 7.0 (code-named Sphinx). The d-cube feature appeared in SQL Server
7.0 as OLAP Services, which evolved into Analysis Services in SQL Server 2000. At the time, I was
sure that OLAP Services would immediately revolutionize the database world. In a nutshell,
Microsoft’s BI tools are all about letting the right people ask the right questions at the right time, then
applying the answers to achieve competitive advantage. You’d think everyone would be using OLAP
by now, but most organizations haven’t yet applied modern OLAP techniques to their decision
making. In fact, many still have no idea what OLAP is. The adoption of BI as a mainstream approach
to problem solving has been much slower than I originally anticipated. However, I believe that the
Brought to you by Microsoft and Windows IT Pro eBooks
76
A Jump Start to SQL Server BI
adoption rate is beginning to pick up and that more companies will embrace BI for competitive
advantage. After all, who doesn’t want to make better decisions?
I firmly believe that Analysis Services is an opportunity-packed specialty for SQL Server
professionals, and I’m putting my money where my mouth is. I’m not going to let my core skills in
SQL Server development rust away, but I do plan to spend most of my R&D time this year focusing
on becoming a hard-core Analysis Services expert. Implementing successful OLAP solutions can have
a tremendous impact on your client’s bottom line, which is fulfilling for a database professional. But
most important, I think the demand for skilled Analysis Services engineers will far exceed the supply,
which is great for my wallet.
I’ve found that learning the basics of Analysis Services is relatively simple. The hardest tasks
to master are modeling data multidimensionally (you’ll need to forget many of the databasenormalization lessons you’ve learned over the years) and using MDX to query the data (MDX is a rich
query language, but it’s much harder to learn and master than SQL).
You’ll need to start somewhere if you’re intent on becoming an Analysis Services pro. I suggest
you start by attempting to master MDX. As the market for Analysis Services experts grows, the
demand for your skills is sure to follow.
—Brian Moran
Techniques for Creating Custom Aggregations
Custom rollup techniques can solve a variety of problems, but a couple of alternative techniques also
let you create custom aggregations. For example, if you need to define an algorithm for aggregating
one or more measures across all dimensions but the basic Analysis Services aggregation types won’t
do, you can use either a calculated member in the measure’s dimension or a calculated cell formula
that you limit to one measure. Both of these techniques are powerful because you use MDX
formulas, which are flexible and extensive, to define them. Calculated cells are possibly the most
powerful custom aggregation tool because they control the way existing (noncalculated) dimension
members are evaluated and you can limit their effects to almost any subset of a cube.
—Russ Whitney
Using Loaded Measures to Customize Aggregations
A common technique for customizing the way you aggregate a measure is to define a calculated
measure based on a loaded measure, then hide the loaded measure. For example, you might
aggregate a Sales measure as a sum, but in two dimensions, you want to aggregate the measure as
an average. In the measure definition, you can specify that a measure be named TempSales and be
loaded directly from the Sales column in the fact table. You can mark this measure as hidden so that
it’s invisible to OLAP client applications; then, you can use TempSales in a calculation without
TempSales being available to your analysis users. You can then use Analysis Manager to create a new
calculated measure named Sales that will return the unmodified TempSales value except when you
want the value to be an average of TempSales.
This technique of creating calculated measures and hiding loaded measures is common in SQL
Server 7.0 OLAP Services implementations because OLAP Services doesn’t support calculated cells or
custom rollup techniques. However, calculated measures in both SQL Server 2000 and 7.0 have
several drawbacks. For example, you can’t use calculated measures when writing back to cube cells.
One reason Analysis Services and OLAP Services won’t let you write back to a calculated measure is
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 77
that a calculated measure doesn’t map directly back to a column in the fact table, so Analysis Services
doesn’t know which loaded measure to modify.
Consequently, calculated cells aren’t particularly useful in budgeting or modeling applications.
Another drawback of calculated members is that you can’t use them with the MDX AGGREGATE
function. The AGGREGATE function is a common MDX function that you use to aggregate a set of
members or tuples. The measure you identify in the set determines the aggregation method that the
AGGREGATE function uses. If you use a calculated measure in the set, Analysis Services (and OLAP
Services) can’t determine the aggregation method, so the AGGREGATE function fails. If you use a
technique such as calculated cells to modify a measure’s aggregation, the AGGREGATE function
works because it is based on the measure’s defined aggregation method.
—Russ Whitney
Caution: Large Dimensions Ahead
Be very careful when dealing with dimensions. Look before you leap into Analysis Services’ very
large dimension feature, which places large dimensions into a separate memory space. This feature is
buggy, so avoid it. Also be careful with Relational OLAP (ROLAP) dimensions, which the server reads
into memory as needed at runtime. Because you can place a ROLAP dimension only into a ROLAP
cube, performance will suffer mightily. In theory, ROLAP mode supports larger dimensions, but it’s
non-functional in my experience.
—Tom Chester
Decoding MDX Secrets
I joked recently that I wished I knew some super-secret MDX command to help solve the problem
of creating virtual dimensions on the fly. Well, believe it or not, an MDX function that was
undocumented in the initial release of Analysis Services provides a great solution to this problem.
The MDX function is CreatePropertySet()—you use it to create a group of calculated members, one
for each member-property value. The query that Listing 2 shows, which creates a group of calculated
members that are children of the All Store Type member in the FoodMart Sales cube, is a simple
example of how to use this function. The query creates one calculated member for each unique
Member Card property value for the members of the Customers Name level. The query creates a new
set, CardTypes, with the new calculated members and displays it on the rows of the result. Figure 2
shows the query’s result set.
—Russ Whitney
Listing 2: Query That Creates a Group of All Store Type Children
WITH SET CardTypes AS ‘CreatePropertySet([Store Type].[All
Store Type], [Customers].[Name].Members,
[Customers].CurrentMember.Properties
(“Member Card”))’
SELECT {[Unit Sales]} ON COLUMNS,
CardTypes ON ROWS
FROM Sales
Brought to you by Microsoft and Windows IT Pro eBooks
78
A Jump Start to SQL Server BI
Figure 2
The results generated by the query in Listing 2
Improve Cube Processing by Creating a Time Dimension Table
Some people create a view from the fact table by using the syntax
SELECT [Fact_Table].[Date]
FROM [Fact_Table]
GROUP BY [Fact_Table].[Date]
and use the view as a source for the Time dimension. This method has a couple of drawbacks. First,
it’s inefficient: The fact table is usually much bigger than the dimension table, and accessing a view of
the fact table is the same as accessing the underlying base table. Another disadvantage of using the
fact table as a source for the Time dimension is that the dimension won’t contain a date that had no
activity. Thus, this method can create gaps in the dimension sequence by skipping weekends,
holidays, and so on. If you want these gaps, remember to exclude irrelevant dates from your Time
dimension table.
A better way to create a Time dimension is to create a special Time dimension table in your data
warehouse to hold all relevant dates. Simply create the table in Microsoft Excel, then use Data
Transformation Services (DTS) to import the table into the data warehouse. This approach to creating
a Time dimension significantly improves dimension and cube processing because you don’t need to
query the fact table to get the Time dimension members. And if the table’s date field is of any time
data type (e.g., smalldatetime), Analysis Services’ and OLAP Services’ Dimension Wizard, which you
use to create dimensions, detects that the dimension could be a Time dimension and prompts you to
confirm its choice, as Figure 3 shows. After you confirm that the dimension is a Time dimension, the
Dimension Wizard helps you create the Time dimension’s levels (e.g., Year, Quarter, Month, Day), as
Figure 4 shows. You can also define the first day and month of the year; the default is January 1.
—Yoram Levin
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 79
Figure 3
Confirming a dimension type
Figure 4
Creating time dimension levels
Brought to you by Microsoft and Windows IT Pro eBooks
80
A Jump Start to SQL Server BI
Transforming Data with DTS
Data Transformation Services (DTS) is widely used as a SQL Server data-transfer tool, but in addition
to simple data transfer, DTS offers the ability to perform data transformations to the data you’re
transferring. The ability to perform data transformations makes DTS more versatile than most other
data-transfer tools. DTS’s transformations let it perform a variety of tasks that would otherwise require
custom programming. For example, by using DTS transformations, you can perform simple conversions such as converting a set of numeric codes into alphabetic codes. Or you can perform more
sophisticated jobs such as turning one row into multiple rows or validating and extracting data from
other database files as the transformation executes.
DTS transformations are row-by-row transactions, and as such, they add overhead to the transfer
process. The amount of added overhead depends mainly on how much work the transformation
script must perform. Simple data conversion adds negligible overhead, while more involved
transformations that require accessing other database tables add significantly more overhead.
To add a custom transformation to a DTS package, click the Transform button on the Select
Source Tables and Views dialog box; you’ll see the Column Mappings, Transformations, and
Constraints dialog box open. Then, click the Transformations tab to display the Edit Script dialog
box, which contains a VBScript template that by default includes code that copies the source
columns to the destination columns. You can freely modify this template to create your own custom
transformations.
The code in Listing 3 shows how DTS converts the values in the column named CHGCOD from
a numeric code in the source database to an alpha code in the target database. You can see that the
code tests the CHGCOD column to see whether it’s equal to a 1, 2, or 3. If the code finds a 1, it
writes an A to the destination table. If the code finds a 2 or 3, it writes a B or C (respectively) to the
destination column. The code writes a D to the target column if it finds any other value.
—Mike Otey
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 81
Listing 3: Code That Shows DTS Conversion of Numeric Code to Alpha Code
Function Main()
Dim nChgCod
DTSDestination(“CUSNUM”) = DTSSource(“CUSNUM”)
DTSDestination(“LSTNAM”) = DTSSource(“LSTNAM”)
DTSDestination(“INIT”) = DTSSource(“INIT”)
DTSDestination(“STREET”) = DTSSource(“STREET”)
DTSDestination(“CITY”) = DTSSource(“CITY”)
DTSDestination(“STATE”) = DTSSource(“STATE”)
DTSDestination(“ZIPCOD”) = DTSSource(“ZIPCOD”)
DTSDestination(“CDTLMT”) = DTSSource(“CDTLMT”)
nChgCod = DTSSource(“CHGCOD”)
If nChgCod = “1” Then
DTSDestination(“CHGCOD”) = “A”
ElseIf nChgCod = “2” Then
DTSDestination(“CHGCOD”) = “B”
ElseIf nChgCod = “3” Then
DTSDestination(“CHGCOD”) = “C”
Else
DTSDestination(“CHGCOD”) = “D”
End If
DTSDestination(“BALDUE”) = DTSSource(“BALDUE”)
DTSDestination(“CDTDUE”) = DTSSource(“CDTDUE”)
Main = DTSTransformStat_OK
End Function
Supporting Disconnected Users
A common shortcoming of analytic applications is that they can’t support mobile or disconnected
users. Because analytic applications are complex, developers move more application functionality to
Web browser–based UIs, which use dynamic HTML (DHTML) and JScript to minimize the amount of
application code that workstations download. Unfortunately, disconnected workstations (e.g., laptops)
can’t run this limited code without a network connection. Because I’m one of those mobile users, I
appreciate applications that I can use whether or not I’m connected. The number of users like me is
growing; more workers in the enterprise are using laptops instead of desktop computers. Managers,
especially, rely on mobility, and they’re heavy consumers of analytic applications. To support
disconnected users, developers need to enable users to take part or all of an application with them.
I don’t have a solution that will make a fancy DHTML Web application run well on a
disconnected laptop. But I can tell you about a new feature in SQL Server 2000 Analysis Services that
makes supporting disconnected users easier: local-cube Data Definition Language. DDL provides a
simple way to create local-cube files in Analysis Services through MDX. These local-cube files let you
put part or all of the data from a server-based cube onto a laptop. You can then use the local-cube
file to perform the same analysis that you could if you were connected to the OLAP server on a
Brought to you by Microsoft and Windows IT Pro eBooks
82
A Jump Start to SQL Server BI
network. To create a local cube without this new MDX syntax, you must construct a verbose SQL-like
statement and pass it to ADO through a connection string.
Local-cube DDL is superior to the old connection-string method for three reasons. First, the
shortcuts in the DDL syntax make using it simpler than repeating all the details of the server-based
cube to create a local cube with the same dimension structures. Second, most OLAP applications
don’t give users the ability to customize the connection string to the required degree, so developers
created custom applications to provide the CREATECUBE functionality. Third, a variation of the new
DDL can create session- scoped temporary cubes
—Russ Whitney
Dependency Risk Analysis
Many businesses use a type of analysis called dependency risk analysis. This type of analysis
determines whether one group of items in your business (e.g., products) is overly dependent on just
one item of another group (e.g., customers). Retailers describe an overly dependent item as at risk.
For example, you might want to find out which products depend most on a single customer. To
answer this question, you need to find what percentage of total store sales for each product comes
from a single customer. To test yourself, find the top 10 highest risk products, and show the
percentage and amount of the product’s sales that are at risk.
Listing 4 shows a query that defines two new measures. One measure calculates the total of Store
Sales for the selected product (e.g., you might want to find the total sales to that product’s top
customer). The other measure calculates the percentage of the product’s total sales that’s at risk.
The MDX query in Listing 4 uses the PercentAtRisk measure to find the 10 products with the highest
percentage of Store Sales at risk. The query then displays both the amount at risk and percentage at
risk for each of the top 10 products.
—Russ Whitney
Listing 4: Query That Defines Two New Measures
WITH
MEMBER [Measures].[AmountAtRisk] AS ‘ SUM( TOPCOUNT([Customers].
[Name].MEMBERS, 1, [Store Sales]), [Store Sales] )’
MEMBER [Measures].[PercentAtRisk] AS ‘ [AmountAtRisk] /
([Store Sales], [Customers].[All Customers]
)’, FORMAT_STRING = ‘#.00%’
SELECT { [AmountAtRisk], [PercentAtRisk] } ON COLUMNS,
TOPCOUNT( [Product].[Product Name].MEMBERS,
10, [PercentAtRisk] ) ON ROWS
FROM Sales
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 83
Choosing the Right Client for the Task
The lessons our development team learned from building a Web-based analysis system can provide a
valuable guide for deploying OLAP solutions in an enterprise. Microsoft Excel provides a capable,
familiar client that you can deploy in a LAN but requires realtime connectivity to the OLAP server.
Office Web Components (OWC) works well for deploying an Analysis Services client in an intranet
because you can easily control the client platform and open ports securely in an intranet. The Analysis Services Thin Web Client Browser provides a good Internet solution when firewalls are in place
and you want minimal impact on the user OS. For any development project, you need to understand
the business requirements and needs of the people who will use the products you develop. By outlining requirements and weighing all the options, you can discover the right solution to satisfy your
client’s requirements.
—Mark Scott and John Lynn
Using Access as a Data Source
To analyze a relational data source, you need to first publish it as a data source in the Windows 2000
or Windows NT environment by establishing a data source name (DSN). To set up Microsoft Access
as a data source, start by accessing the Data Sources (ODBC) settings in Windows NT through the
Start menu under Settings, Control Panel. In Windows 2000, choose Start, Settings, Administrative
Tools. Double-click to open the Data Sources (ODBC), then select the System DSN tab. Click Add; in
the Create New Data Source window, select Microsoft Access Driver (*.mdb). Click Finish to display
the ODBC Microsoft Access Setup dialog box. Under Data Source Name, enter the name you choose
for your Access data source. In the Setup Wizard’s Database section, click Select. In the Select
Database dialog box, browse to find the database, select it, then click OK. To finish the source-data
selection sequence, click OK in the ODBC Microsoft Access Setup and the ODBC Data Source
Administrator dialog boxes.
—Frances Keeping
Calculating Utilization
One of the most common measurements of group or individual performance in a consulting agency
is utilization. Decision makers in consulting groups calculate utilization by dividing the total number
of hours billed by the total number of business hours available (e.g., 8 hours for each business day).
Having a high percentage of utilization is good because it means you’re effectively deploying
available resources to generate revenue. You can use the cube structure that Figure 5 shows to create
the MDX for a measure that calculates utilization for a selected employee and time period. The query
in Listing 5 calculates utilization as a percentage of hours worked. The meat of the formula is in the
definition of the calculated measure, AvailableHours. AvailableHours multiplies the number of work
days in the selected time period by 8 hours. You get the total number of work days by eliminating
weekend days and holidays from the total number of calendar days. The Utilization measure then
divides the total work hours by the available work hours to get a percentage. The result is a
percentage that can be more than 100 percent if the average number of work hours exceeds 8 hours
per day.
—Russ Whitney
Brought to you by Microsoft and Windows IT Pro eBooks
84
A Jump Start to SQL Server BI
Figure 5
Cube structure to create MDX
Listing 5: Query That Calculates Utilization as a Percentage of Hours Worked
WITH
MEMBER [Measures].[AvailableHours] AS
‘COUNT( FILTER( DESCENDANTS( [Time].[Project].CURRENTMEMBER,
[Time].[Project].[Day] ),
([Time].[Project].CURRENTMEMBER.PROPERTIES(“Weekend”) = “0”) AND
([Time].[Project].CURRENTMEMBER.PROPERTIES(“Holiday”) = “0”))) * 8’
MEMBER [Measures].[Utilization] AS ‘ [Hours] /
[AvailableHours]’, FORMAT_STRING = ‘#.0%’
SELECT {[Hours], [Utilization], [AvailableHours] } ON COLUMNS,
[Time].[Project].[All Time].[2002].CHILDREN ON ROWS
FROM Tracker
WHERE ([Employee].[All Employee].[Admin].[rwhitney])
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 85
Use Member Properties Judiciously
When you start up the OLAP server, the server loads every dimension—including member keys,
names, and member properties—into server memory. Because Analysis Services is limited to 3GB of
RAM, this is one of the primary bottlenecks for enterprise-scale deployments. For this reason, limit
member properties to the bare essentials, particularly when the level has lots of members.
—Tom Chester
Get Level Names Right from the Get-Go
When you first build a dimension, level names default to the same names as the column names in
the dimension table (except that Analysis Manager replaces special characters with spaces). This
means that you wind up with level names like Cust Code, or worse. Then, after the cube is
processed, you can’t change the level names without reprocessing the dimension, which in turn
requires that you reprocess the cube. Because it’s painful to rename levels after the cube is
processed, many cubes go into production with frighteningly cryptic level names. To compound
matters, MDX formulas are often written with dependencies on the unfriendly level names, adding
another hurdle to the level-rename task. Cubes are supposed to be easily usable right out of the box,
so avoid this pitfall by getting the level names right from the beginning. As soon as you build a
dimension, change the default level names to user-friendly names before placing the dimension into
the cube.
—Tom Chester
Aggregating a Selected Group of Members
Sometimes you need to aggregate a group of dimension members for one query. For example, suppose you want to return Unit Sales for the quarters of 1997 for each product family. The solution is
easy. But what if you want to run the same query for only the customers in California and Oregon,
leaving out Washington? This is a common problem with a simple solution. All you have to do is
create a calculated member that aggregates California and Oregon, and select that calculated member
in the WHERE clause, as Listing 6 shows.
Listing 6: Code That Creates a Calculated Member and Selects It in a WHERE Clause
WITH member [Customers].[All Customers].[USA].[CA-OR] AS ‘ Aggregate({[Customers].[All
Customers].[USA].[CA], [Customers].[All Customers].[USA].[OR] })’
SELECT [Time].[1997].Children ON Columns, [Product].[Product Family].Members ON Rows
FROM [Sales]
WHERE ( [Customers].[All Customers].[USA].[CA-OR], [Unit Sales] )
The Aggregate function aggregates the set of members passed to it and uses the Aggregation
method defined for the member’s dimension. In this case, the Customers dimension is aggregated
with a Sum function that we defined in the OLAP Manager when we built the cube, so the new
dimension member [CA-OR] is the sum of [CA] and [OR].
This tip is useful, but be careful. Performance can suffer if you use aggregation heavily in the
WHERE clause. If you have a common alternative aggregation, you might be better off creating a
second hierarchy for your dimension.
—Brian Moran and Russ Whitney
Brought to you by Microsoft and Windows IT Pro eBooks
86
A Jump Start to SQL Server BI
Determining the Percentage of a Product’s Contribution
A common business problem is determining percentage of contribution to a group. For example, you
might want to know what percentage of the total revenue of a product line a particular product contributed, or what percentage of growth of sales in a country each region in that country contributed.
Here’s one way to solve this problem: For each revenue or dimension combination you want to
analyze, create a new calculated member. For instance, if you want to analyze Store Sales as a percent of a group in the Product dimension, create a member, as Listing 7 shows. Figure 6 shows the
result set from this query.
—Brian Moran and Russ Whitney
Listing 7: Code That Creates a Calculated Member
CREATE MEMBER [Sales].[Measures].[Store Sales Perc] AS ‘
(Product.CurrentMember, [Store Sales])
/ (Product.CurrentMember.Parent, [Store Sales])’
-- The preceding code lets you write the following simple MDX query:
SELECT { [Store Sales Perc] } ON COLUMNS, [Drink].children ON ROWS
FROM Sales
Figure 6
The results generated by the query in Listing 7
Avoid Crippled Client Software
Can you imagine using a front-end tool for a relational database management system (RDBMS) that
doesn’t let you specify an SQL statement? Of course not. Yet somehow that’s what developers are
faced with in the OLAP space. Remarkably, many shrink-wrap query and reporting tools that work
with Analysis Services are crippled in a fundamental sense—they don’t let developers supply an MDX
SELECT statement. The problem is this: None of the commercial clients, even the most robust, come
close to exposing the full power of MDX. Maybe simple cube browsing is all your users require.
Nonetheless, to avoid painting yourself into a corner, choose a front-end tool that lets the developer
specify custom MDX SELECT statements.
There’s a catch to this advice, however. The client tools that don’t expose MDX tend not to be
tightly bound to Analysis Services—they provide connectivity to other data sources. However, I don’t
think it’s asking too much for these query- and reporting-tool vendors to expose an MDX SELECT
query string as a pass-through.
—Tom Chester
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 87
Setting OLAP Cube Aggregation Options
After you create an OLAP cube and choose the storage technique that’s optimal for your situation, the
OLAP server designs the aggregations and processes the cube. If you choose the Relational OLAP
(ROLAP) storage technique, the OLAP server will create the summary tables in the source database
after it processes the cube. Otherwise, aggregations are stored in OLAP server native format. You can
choose the degree of aggregation by considering the level of query optimization you want versus the
amount of disk space required. Figure 7 shows the Storage Design Wizard. For example, I chose
80 percent performance, which produced 124 aggregations and required 22.5MB of storage space for
Multidimensional OLAP (MOLAP) storage. The aggregations roll up, so if you choose low performance in favor of conserving disk space, the OLAP server query engine will satisfy queries by
summing existing aggregations.
—Bob Pfeiff, Tom Chester
Figure 7
Storage Design Wizard
Use Views as the Data Source
Always use views as the data source for dimension tables and fact tables. In addition to providing a
valuable abstraction layer between table and cube, views let you leverage your staff’s expertise with
relational database management systems (RDBMSs). When you use a view as a fact table, you can
manage incremental updates by altering the WHERE clause within the view instead of assigning the
WHERE clause to an OLAP partition. When you use a view to source a dimension, you can define
logic inside the view that otherwise would have to be defined in Analysis Services (e.g., formulated
member names, formulated member properties).
—Tom Chester
Brought to you by Microsoft and Windows IT Pro eBooks
88
A Jump Start to SQL Server BI
Enter Count Estimates
When you first build a dimension, Analysis Services stores the member count for each level as a
property of the level. This count is never updated unless you explicitly update it (manually or by
using the Tools, Count Dimension Members command). In addition, it’s typical for cubes to initially
be built against a subset of the data warehouse. In this case, the cube will likely go into production
with the count properties understated by an order of magnitude. Here’s the gotcha: The Storage
Design Wizard uses these counts in its algorithm when you’re designing aggregations. When the
counts are wrong, the Storage Design Wizard is less effective at creating an optimal set of aggregations. The solution is simple—when you build the dimension, manually enter estimated counts for
each level.
Using Dynamic Properties to Stabilize DTS
To cut down on coding and thereby minimize errors, Microsoft added the Dynamic Properties task to
Data Transformation Services (DTS) in SQL Server 2000. With the assistance of this task, you don’t
have to create bulky ActiveX Script tasks to dynamically set a DTS property, such as a username that
you use to establish a connection. This task lets you change the value of any nonidentifying property
that’s accessible through the DTS object model (e.g., non-name/ID properties of a step, connection,
task, package, or global variable). What once took 3 weeks to stabilize, you can now write and
stabilize in less than a day. Using the Dynamic Properties task gives you faster performance than
writing the same process with an ActiveX Script task because DTS doesn’t resolve the ActiveX Script
task until runtime.
—Brian Knight
Leave Snowflakes Alone
Analysis Services lets you source dimensions from either a normalized snowflake schema or a flattened star schema. Microsoft recommends flattening snowflake dimensions into stars for performance
reasons, a practice that most Analysis Services developers follow. However, unless the relational data
mart is consumed by something other than Analysis Services, this practice has few benefits and
considerable drawbacks. For these reasons, resist the urge to flatten.
A snowflake schema provides the benefits of a normalized design. With a star schema, managing
attributes for the repeating non-leaf members is awkward at best.
A snowflake gives you unique keys at each level. This lets you import data into a cube at any
level of granularity, a critical ability in financial-planning applications, for example.
Because dimension tables aren’t queried at runtime (except for in the notoriously slow relational
OLAP—ROLAP—mode), snowflake dimensions have no impact on query performance. The only
downside to a snowflake dimension is that it (the dimension, not the cube) is slower to process than
a star because of the joins that are necessary.
However, the time it takes to process dimensions is a minor factor compared to the time
necessary for cube processing. Unless the dimension is huge and the time window in which
processing must occur is tight, snowflakes are the way to go.
—Tom Chester
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 89
Create Grouping Levels Manually
No dimension member can have more than 64,000 children, including the All member. This limit isn’t
as onerous as it sounds; usability is apt to nail you before the hard limit does. A member with even
10,000 children usually presents a usability problem—that’s a lot of rows to dump on a user drilling
down into the dimension.
Whether you’re fighting the limit or simply working to design your dimension so that it provides
bite-size drilldowns, the solution is to build deep, meaningful hierarchies. But when there’s no raw
material from which to build a meaningful hierarchy, you must resort to a grouping level, aka a
Rolodex level, such as the first letter of the last name for a customer dimension. Analysis Services has
a feature (create member groups in the Dimension Wizard) that can create a grouping level for you
automatically. Don’t use it! You won’t have control over the grouping boundaries. Instead, construct
the level manually. This entails adding a new level to the dimension, then modifying the Member
Name Column and Member Key Column properties. For instance, you might define the member key
column and member name column for the grouping level as follows:
LEFT(“CustomerDimTable”.”
CustomerName”, 1)
This expression bases the level on the first letter of the customer name, providing Rolodex-style
navigation. Bear in mind, however, that this is a SQL pass-through; the expression is passed to the
RDBMS, so the RDBMS dictates the syntax. That is, T-SQL has a LEFT() function, but another RDBMS
might not.
—Tom Chester
Understand the Role of MDX
Did you ever try to swim without getting wet? For all but the simplest of databases, that’s what it’s
like when you try to design an OLAP solution without using MDX. Because shrink-wrap client
software often negates the need to write MDX SELECT statements, many developers think they can
successfully avoid MDX. This is folly. Sure, not every project requires MDX SELECT statements;
commercial software is adequate for many situations. But MDX calculations should play an important
role in most Analysis Services solutions, even those that aren’t calculation-intensive on the surface.
Perhaps the most common example is a virtual cube that’s based on two or more source cubes.
Calculated members are usually required to “glue” the virtual cube together into a seamless whole.
Although the MDX isn’t necessarily complex, developers unaware of the role of MDX wind up
making costly mistakes. Either they avoid virtual cubes entirely, or they shift logic that’s easily
implemented in MDX to the extraction-transformation-load (ETL) process, where it’s more complicated and rigidly set.
—Tom Chester
Using NON EMPTY to Include Empty Cells
Many multidimensional cubes have empty cells, which occur because a user didn’t load data into the
cube for these members. For example, if you inspect the Sales cube in the FoodMart sample, its creators didn’t load any data for 1998. You must use the NON EMPTY modifier to write an MDX query
that includes 1998, as Listing 8 shows.
—Brian Moran and Russ Whitney
Brought to you by Microsoft and Windows IT Pro eBooks
90
A Jump Start to SQL Server BI
Listing 8: MDX Code That Uses NON EMPTY to Include 1998
SELECT NON EMPTY {[Time].[1997], [Time].[1998]}
ON COLUMNS,[Promotion Media].[Media
Type].Members ON ROWS
FROM Sales
Formatting Financial Reports
When creating a financial report such as an income statement, you need to display the subtotals
(which are parent members of a dimension) at the bottom of the statement—after the details (which
are children). You can use the MDX Hierarchize() function with the POST option to force parent
members to appear after their children. The following example shows the Hierarchize() function on
the FoodMart 2000 Sales cube:
WITH SET MySet AS ‘{CA,CA.Children,[OR],[OR].Children}’
SELECT Hierarchize(MySet, POST) ON Columns
FROM sales
WHERE [Sales Count]
How can you change this query to sort the set MySet in ascending order while making sure the
parents appear after their children?
Thanks to Shahar Prish of Microsoft for providing the clever answer that Listing 9 shows. First, he
sorted the items in descending order while preserving peer groupings (i.e., keeping children of a
common parent together). Then, he used a Generate() function to reverse the order of the set. The
result maintains the peer groupings, keeps the items in ascending order, and places the parents after
the children. Notice that Shahar uses the AS keyword to name the sorted set MySetIterator. He also
uses the Count and Current properties on the named set.
—Russ Whitney
Listing 9: Making Parents Appear After Their Children
WITH SET MySet AS ‘{CA,CA.Children,[OR],[OR].Children}’
SELECT
Generate(ORDER(MySet,([Sales Count],[1998]),DESC) AS
MySetIterator,{MySetIterator.Item(MySetIterator.CountRank(MySetIterator.Current.Item(0),MySetIterator))}) ON 0
FROM sales
WHERE [Sales Count]
Analyzing Store Revenue
Retail businesses sometimes evaluate their stores’ performance by analyzing revenue per square foot
or cost per square foot. Use the FoodMart 2000 Sales cube to determine what the total store square
footage is for each Store State (Store State is a level in the Store dimension). Note that what makes
this problem unique is that square footage is stored as a member property in the Store dimension.
Listing 10 shows a query that solves this problem. This query is interesting because it demonstrates how to aggregate (sum) a numeric member property. The query creates a new measure that
returns the value of the Store Sqft member property. If the selected store member is above the Store
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 91
Name level in the dimension, the query sums all store square footage values below the selected
member in the hierarchy to determine the square footage value. Because the MDX treats member
properties as strings, the query uses the VBScript function VAL() to convert the member property
from a string to a number before summing all the store square footage values.
—Russ Whitney
Listing 10: Query That Returns the Value of the Store Sqft Member Property
WITH MEMBER [Measures].[StoreSqft] AS
‘IIF(([Store].CurrentMember.Level.Name=”Store Name”),
VAL([Store].CurrentMember.Properties(“Store Sqft”)),
SUM(DESCENDANTS([Store].currentmember,[Store].[Store Name]),
VAL([Store].CurrentMember.Properties(“Store Sqft”))))’
SELECT
{[Measures].[StoreSqft]} ON COLUMNS,
FILTER(DESCENDANTS([Store].[All Stores],[Store].[Store State]),
[Measures].[StoreSqft]>0)
ON ROWS
FROM Sales
Use Counts to Analyze Textual Information
You can analyze any database—with or without numeric information—by using counts. Count
measures can be easy or complex. In the FoodMart database, setting up the Sales Count measure is
simple. You just create a new measure based on the fact table’s primary key and set the measure’s
type to count. But let’s look at a more complex example. Say a table called fact contains line items
(entries) for invoices. An invoice can contain one or more entries. So you probably want a count
measure that counts invoices, not entries. To count invoices, you want to count only the groups of
entries that make up an invoice.
One way to solve this problem is to create a distinct count measure based on the fact table’s
invoice number column. This measure will give you the invoice count values you want, but distinct
count measures have two serious limitations. First, each cube can contain only one distinct count
measure. Second, Analysis Services doesn’t aggregate distinct counts through dimension levels as it
does other measures. The distinct count values are predetermined and stored at each cell with other
measures, so you can’t create new aggregations during an MDX query’s execution. In an MDX query
or a calculated member definition, using a function such as
Aggregate( [USA], [Mexico] )
won’t work with a distinct count measure selected; determining the result of the function would
require rescanning the fact table because the cube doesn’t contain enough information to determine
the function’s result. Analysis Services can’t rescan the source table, but even if it could, the process
would be prohibitively slow. The effect of this restriction is that distinct count measures don’t
generally work well with other calculated members or sets.
A second solution is to create an extra column in the source table to store an invoice count. Fill
one entry for each invoice with a value of 1; fill all other instances of the invoice count field with
values of 0. You can then create an Invoice Count measure that’s a sum of this invoice count column.
Brought to you by Microsoft and Windows IT Pro eBooks
92
A Jump Start to SQL Server BI
This solution works as long as you select in the cube a cell that encompasses a group of entries that
make up complete invoices. If your cell includes only some of the entries in an invoice, the invoice
count column might not include the entry that contains the 1 value and thus would produce a sum
of 0 instead of 1 for that invoice.
A third solution is to use a combination of the two approaches. Create an invoice distinct count
measure, an invoice sum count measure, and an invoice count calculated measure that helps you
determine which of the other two measures to use based on the cell that’s selected. The invoice
distinct count measure will return the correct answer when only some of the entries in an invoice are
selected, and the invoice sum count will work in all other situations. The invoice sum count also
gives you the benefit of working when custom combinations of members are selected. This invoice
count example shows that, in real-world situations, count measures can get complicated because the
count might depend on a distinct combination of a group of fact table columns.
—Russ Whitney
Consolidation Analysis
A common type of retail sales analysis is consolidation analysis. One example of consolidation analysis is customer consolidation analysis: If fewer customers are buying more products, your customers
are consolidating. If more customers are buying fewer products, your customers aren’t consolidating.
In the FoodMart 2000 Sales cube, you can use the Store Sales measure to determine the top 10 customers. Then, you can write an MDX query to determine whether the top 10 FoodMart customers are
consolidating throughout the four quarters of 1997. But first, you need to create an MDX query that
includes the four quarters of 1997 on the columns of the query’s result. Then, create two rows. The
first row should display the total number of store sales that the top 10 customers purchased. The
second row should display the percentage of total store sales that the top 10 customers purchased.
Listing 11 shows the code that produces this result.
Listing 11: Code That Displays Purchase Information for FoodMart’s Top 10 Customers
WITH SET [Top 10] AS ‘TOPCOUNT( [Customers].[Name].Members,
10, ([Customers].CURRENTMEMBER, [Store Sales] ) )’
MEMBER [Measures].[Top 10 Amount] AS ‘Sum([Top 10], [Store Sales])’
MEMBER [Measures].[Top 10 Percent] AS ‘
[Top 10 Amount] / ([Customers].[All Customers], [Store Sales])’,
FORMAT_STRING = ‘#.00%’
SELECT [Time].[1997].CHILDREN ON COLUMNS,
{ [Top 10 Amount], [Top 10 Percent] } ON ROWS
FROM Sales
I made this query a little easier to read by first creating a set with the top 10 customers based on
Store Sales, then using this set in the other two calculated measure definitions. The first calculated
measure sums the store sales for the top 10 customers to determine the store sales that the top
customers are responsible for. Next, the Top 10 Percent measure determines what percentage of the
total store sales comes from the top 10 customers. The query then displays both the Top 10 Amount
and the Top 10 Percent for each quarter of 1997.
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 93
The query’s result shows that the top 10 customers are consolidating slightly. During first quarter
1997, the top 10 customers were responsible for 1.41 percent of all store sales; during fourth quarter
1997, that group accounted for 1.77 percent of store sales.
—Russ Whitney
Working with Analysis Services Programmatically
Analysis Services has three programmatic interfaces that you can use from your analysis application.
The first two interfaces are client-side interfaces: ADO MD and OLE DB for OLAP. Both of these
programming interfaces offer functionality for data exploration, metadata exploration, write-back
capabilities, and read-only analysis functions. Only the write-back capabilities affect the contents of
the cube that other users in the system share. If you want to make other types of changes to Analysis
Services data programmatically, you have to use the administrative programming interface Decision
Support Objects (DSO). DSO lets you create and alter cubes, dimensions, and calculated members
and use other functions that you can perform interactively through the Analysis Manager application.
—Russ Whitney
Filtering on Member Properties in SQL Server 7.0
Even if your OLAP client application doesn’t support Member Properties, you can still filter based on
their values by using the virtual dimensions feature of SQL Server 7.0 OLAP Services. Virtual dimensions expose Member Properties as another dimension in which the members of the dimension are
the individual values of the Member Property. After you’ve defined a Member Property in OLAP
Manager, you can use that property as the basis for a virtual dimension. For example, the Store Size
in the SQFT dimension in the FoodMart sample database is a virtual dimension based on the Store
Sqft Member Property in the Store Name level of the Store dimension. By using OLAP Manager, you
can tell the difference between a real dimension and a virtual dimension by looking at the icon in the
cube editor. Figure 8 shows the three virtual dimensions based on Member Properties of the Store
Name member. Virtual dimension icons have a small calculator as part of the image. Virtual
dimensions include all the unique values of the underlying Member Property as dimension members,
and these members aggregate to an ALL member. Thus, virtual dimensions have only two hierarchical
levels. In the client application, the virtual dimensions show up as just another dimension and don’t
significantly increase the size of the cube. Unfortunately, in the current release of OLAP Services,
virtual dimensions are slow compared to real dimensions. Still, virtual dimensions are worth using
because they let you filter OLAP queries on Member Properties even when the client application
might not directly support that capability.
—Brian Moran and Russ Whitney
Brought to you by Microsoft and Windows IT Pro eBooks
94
A Jump Start to SQL Server BI
Figure 8
Three virtual dimensions based on Member Properties of the Store Name member
Improving Query Performance
When members of our DBA team were preparing our data for graphing, we executed some
preliminary queries to pull data from the System Monitor, generated CounterData and CounterDetails
tables, and received some interesting results. First, we found that pulling data from the default table
structures was slow. Then, we added a calculated field and index to CounterData and found that
queries performed significantly faster when CounterDateTime was an indexed datetime field rather
than a non-indexed char(24) field. (We appreciate the assistance the SQL Server Magazine technical
editors gave us in figuring this out.) But when we modified the structure of the CounterData table
with the appropriate indexes and calculated fields, System Monitor wouldn’t log the data at all,
although our queries performed somewhat better. It turns out that System Monitor tries to recreate the
tables when it finds structural changes in them. We also tried creating an INSTEAD OF trigger to
route the data entry into another table. However, when we did so, SQL Server bulk-loaded the data
and ignored triggers. We thought about modifying the tables, but you can’t expect assistance from
Microsoft if you change the system tables, so we recommend that you don’t alter them.
In the Microsoft Platform Software Development Kit (SDK) under the Performance Monitor
heading (at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/perfmon/base
/performance_data.asp), Microsoft describes the fields of the CounterData table as Table 1 shows.
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 95
Table 1: Microsoft Descriptions of CounterData Table Fields
Field Name
GUID
CounterID
RecordIndex
CounterDateTime
CounterValue
FirstValueA
FirstValueB
SecondValueA
SecondValueB
Description
GUID for this data set. Refer to the description of the DisplayToID table to correlate this value to
a log file.
The identifier of the counter being collected. This is the key to the CounterDetails table.
The sample index for a specific counter identifier and collection GUID. Each successive sample
in this log file is assigned a RecordIndex value, which increases by one throughout the time the
data is logged.
The time the collection of the values of this counter was started, in UTC time.
The formatted value of the counter.
The raw performance data value used to calculate the formatted data value. Some formatted
counters require the calculation of four values in the following manner: (SecondValueA FirstValueA)/(SecondValueB - FirstValueB). The remaining columns in this table provide the
remaining values in this calculation.
Refer to the description of FirstValueA.
Refer to the description of FirstValueA.
Refer to the description of FirstValueA.
However, the description of CounterDateTime is incorrect. If you investigate the System
Monitor tables CounterData and CounterDetails, you’ll find that the counter names are stored in
CounterDetails and counter values are stored in CounterData, using one column for every counter
and logged one row at a time. For example, if you logged the 12 counters for 2 minutes,
CounterDetails would contain 12 records for the names of the counters, whereas CounterData would
contain 24 entries for each minute the data was logged. One way to make pulling data from this
format more efficient and effective is to transform the data into a pivot-table format in which one
column exists for the date and time and additional columns exist for each counter whose data you
want to view. Interestingly, this is the same format that a System Monitor CSV file uses.
—Mark Solomon
Using SQL ALIAS for the AS/400
The AS/400 supports a file concept known as multiple-member files, in which one file (or table) can
possess several different members. Each member is a part of the same file or table and shares the same
schema, but the members are uniquely named and have unique data. ODBC and OLE DB have no
built-in mechanism for accessing multiple members. By default, ODBC always accesses the first member
in a multimember file. To enable ODBC-based applications such as Data Transformation Services (DTS)
to access multiple-member files, you need to use the AS/400’s SQL ALIAS statement. The ALIAS statement lets you create an alias for each member you need to access. Then, your ODBC application can
access the alias, which in turn connects to the appropriate member. These SQL aliases are persistent, so
you need to create them only once. The following statement shows how to create an alias:
CREATE ALIAS MYLIB.FILE1MBR2 FOR MYLIB.MYFILE(MBR2)
This statement creates a new alias named FILE1MBR2 for the multimember file MYFILE. The
ODBC or OLE DB application then connects to that specific member, using the alias name
FILE1MBR2 to access the second member in the file MYFILE.
—Michael Otey
Brought to you by Microsoft and Windows IT Pro eBooks
96
A Jump Start to SQL Server BI
Setting Up English Query
English Query works best with normalized databases; however, your circumstances might mandate
a structure that isn’t fully normalized. In this case, you can use views to solve problems that nonnormalized databases cause. The English Query domain editor doesn’t automatically import views.
To add a view as an entity, select Table from the Insert menu and enter the name of the view. The
English Query Help file provides examples of how to use views with non-normalized data.
Another tip is to define a primary key for each table in your English Query application. English
Query requires primary keys to perform joins between tables to satisfy user requests. If you haven’t
defined keys in your database, you need to define them in the domain editor. English Query can’t
build your application correctly without primary keys.
When you develop English Query applications, remember that case is significant. For example,
English Query knows that you’re asking about a proper noun because you capitalize the words in the
query. Finally, if you’re running Internet Information Server (IIS) 4.0 with the Windows Scripting Host
(WSH), the fastest way to build and deploy a test application is to run the setupasp.vbs macro from
C:\programfiles\microsoftenglishquery\samples\asp2. This macro automatically installs and configures
your data, so you can start testing immediately.
—Ken Miller
When Do You Use Web Services?
Let’s say that your company uses a supply-chain application that stores your customers’ orders in a
SQL Server database and keeps track of each order’s status. Currently, when customers want to know
which of their orders are pending, they contact your customer-service representative, who queries the
database for that information. Customers then update their ordering systems. But suppose a customer
wants to streamline the process by using an application to request order status directly from your
system. To enable this type of access to your system, you and the customer need to agree on the
interface the customer will use to make the request and the format in which you will return the
requested data.
This scenario is an ideal application for Web services because you can use SOAP to build a
single standards-based interface that works for many different customers with varying needs,
regardless of the software applications and computing platform their enterprises use. Additionally,
SOAP lets you build a loosely coupled interface that incorporates XML as the data format. (A loosely
coupled application lets you reconfigure, redeploy, or relocate the implementation without affecting
dependent applications.) By using XML, you gain extensibility that lets you expand the scope of the
data you can provide to your customers in the future. Simply put, supplying a Web service lets you
leverage the full value of XML’s standard format, extensibility, and platform independence.
—Rich Rollman
The Security Connection
Here’s a summary of steps you can take to optimize SQL Server security and connectivity.
• Use Windows-only authentication with SQL Server.
• Use trusted connections instead of strings that pass SQL Server usernames and passwords.
• Put the connection objects in DLLs and put them in Microsoft Transaction Server (MTS).
Brought to you by Microsoft and Windows IT Pro eBooks
Section II: BI Tips and Techniques 97
• Set your code to use OLE DB instead of ODBC if you’re using ADO. With ADO, ODBC calls
OLE DB, so by using OLE DB directly, you improve performance by eliminating a processing
layer.
• Use TCP/IP between your IIS and SQL Server machines, not the default Named Pipes, if IIS and
SQL Server are on separate servers. As Microsoft article “PRB: 80004005 ConnectionOpen
(CreateFile()) Error Accessing SQL” at
http://support.microsoft.com/support/kb/articles/q175/6/71.asp states, “When Named Pipes are
used to access the SQL Server, IIS tries to impersonate the authenticated user, but it does not
have the ability to prove its identity.”
• Put your connections and stored procedure calls into Visual Basic (VB) code DLLs, install them in
MTS (which will automatically pool connections for you), and create server objects in VBScript to
use the connections.
• Ask for the correct Microsoft department if you need help using ADO-based code to talk to SQL
Server. Microsoft Technical Support not only has IIS and SQL Server experts; it also has ADO-toSQL Server experts.
—John D. Lambert
Brought to you by Microsoft and Windows IT Pro eBooks
98
A Jump Start to SQL Server BI
Section III
New BI Features in
SQL Server 2005
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 1 Building Better BI in SQL Server 2005 99
Chapter 1:
Building Better BI in SQL Server 2005
Since its inception, Microsoft’s SQL Server Business Intelligence (BI) team has been guided by the
overriding goal of making business data usable and accessible to as many people as possible. As the
team’s general manager, Bill Baker works with the people who design and develop BI tools such as
Integration Services (formerly data transformation services—DTS), Analysis Services, and Reporting
Services. Baker recently talked with SQL Server Magazine about SQL Server 2005’s new BI tools and
how they work together to streamline delivery of business-critical information.
How are SQL Server 2005’s BI enhancements meeting Microsoft’s goals for serving the BI
community? And how long has your team has been working on these enhancements?
Our goal since we started the SQL Server BI team has been to give as many people as possible in
every organization greater insight into their business and the market. We call this “BI for the Masses,”
and with every version of SQL Server and Microsoft Office, we take further steps to make BI available
to every person at every level of a company. For example, in the integration space, SQL Server 2005
Integration Services delivers far greater throughput and more data warehousing and BI functionality
out of the box. In addition, customers need to analyze far more data from many more sources and
with more immediacy than ever before. In the analysis space, our investments in the Unified
Dimensional Model (UDM) and proactive caching move SQL Server 2005 Analysis Services beyond
the niche-OLAP market and into the mainstream. Our new Report Builder in SQL Server 2005
Reporting Services opens up report authoring way beyond the Visual Studio audience we support
well today. Our vision is about getting the right information, in the right format, to the people who
need it—when they need it. Every BI investment we make supports that goal.
Initial planning for SQL Server 2005 started several years ago, but we are now starting to see the
fruits of our labor and are definitely in “ship mode.” Through our beta releases and Community
Technology Previews (CTPs)—advance previews into the upcoming beta—we are receiving incredible
customer feedback on our features and implementations.
What kind of feedback have you been getting from beta testers, and which features are they
most enthusiastic about?
Our customers tell us they really appreciate how comprehensive our BI solution is. Our solution not
only enables seamless integration of components, but it’s cost-effective, which is essential. We are
getting great feedback on the BI Development Studio—formerly called the BI Workbench—which
provides one development environment for Integration Services, Reporting Services, data mining, and
Analysis Services. Beta testers have also praised the integration of the BI engines into SQL Server
Management Studio—formerly called the SQL Server Workbench—which combines the functionality
of Enterprise Manager, Query Analyzer, and Analysis Manager into one integrated tool. Beta testers
also appreciated the overall ability SQL Server 2005 gives them to deploy and manage BI
applications.
Brought to you by Microsoft and Windows IT Pro eBooks
100
A Jump Start to SQL Server BI
According to news reports, Microsoft and some large customers have deployed SQL Server
2005 Beta 2 in production environments. What is your recommendation for deploying Beta 2
and running it in production? What caveats do you have for businesses eager to move to the
new version now?
We’re amazed at how many customers ask us to support their Beta 2 implementation in production.
Honestly, we don’t recommend it since there is no Service Level Agreement (SLA) for Beta 2, but that
has not stopped several customers. So far, they are having good experiences, but our recommendation is to get experience with the beta bits, start developing your applications, and plan to roll out
your applications with the final version of SQL Server 2005.
How compatible are SQL Server 2000’s BI tools (OLAP, DTS, data mining) and SQL Server
2005’s new BI tools? Because some of SQL Server 2005’s BI tools—such as Integration
Services—are completely rewritten, will they still work with SQL Server 2000 data and
packages?
This is an area where we need to be very, very clear with our customers because the choice to
upgrade or migrate varies depending on the situation. Our commitment is to be transparent about
what will upgrade automatically and what will require migration, and we have migration aids for any
objects that don’t come over automatically.
For example, we will continue to support SQL Server 2000 DTS packages running beside SQL
Server 2005 Integration Services. However, if you want to use some of the new SQL Server 2005
Integration Services features or performance, you will need to migrate your packages. We do not
automatically migrate DTS packages because they usually contain code, very often in script, and the
new SQL Server 2005 Integration Services has newer and better ways to do what that code used to
do. In some cases, the benefits of the new technology will be worth rewriting the packages.
SQL Server 2000 Analysis Services supports only clustering and decision-tree data-mining
algorithms. Does SQL Server 2005 add support for other algorithms?
Yes. The next version of SQL Server Analysis Services will include five new algorithms in the
extensible data-mining solution. We have a great partnership with Microsoft Research that lets us
cooperate on new data-mining algorithms, so we’ve identified the most popular requests and added
algorithms for association sets, neural nets, text mining, and other needs.
We also made enhancements to data mining, including a set of rich, graphical model editors and
viewers in the BI Development Studio. We added support for training and querying data-mining
models inside the extraction, transformation, and loading (ETL) pipeline. Developers will benefit from
easy integration of data mining into their applications, and analysts will receive finer-grain control
over their models. We’re excited about these enhancements because they address making data
mining and data quality operational.
Microsoft relies on an integrated technology stack—from OS to database to user interface.
How does that integration help Microsoft’s BI offerings better serve your customers’ needs?
Our belief in the Windows platform is long-standing and probably well understood by now. It’s
important to note that while we have an integrated offering from top to bottom, it is also an open
environment. This openness is critical for BI, where much of the opportunity for our customers is in
gaining additional insight and value from the operational systems they already have. All of our BI
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 1 Building Better BI in SQL Server 2005 101
platform components can read data from a huge variety of databases and applications, and they
provide Web services for embedding and integrating with other applications—even on other
platforms. We get strength from the integration and consistency of the elements we provide, but lose
nothing in terms of openness. Our customers benefit from the flexibility of our interoperability. By
using our integrated solution, customers also witness a reduction in training time, management staff,
and total cost of ownership. It’s a win-win situation.
SQL Server 2005 will be the first release in which database tools converge with Visual Studio
development tools. Can you tell us what it took to align these two releases and what benefits
customers will realize from the change?
Databases and applications used to be two separate worlds, but more and more, people are
recognizing the similarities between application development and database development. For
instance, what interesting business application doesn’t store and access data in a database? With
Visual Studio 2005 (codenamed Whidbey) and SQL Server 2005, we’ve taken the next step in melding
the database- and application-development experiences. We based our BI Development Studio on
Visual Studio, and all the Visual Studio features that support team and enterprise development,
including source-code control and deployment, also work for the data warehouse and BI developer.
We built a single environment where people can develop all of the components of a data
warehousing or BI application, including relational design, ETL, cubes, reports, data mining, and even
code if desired. There is no other end-to-end, professional-grade environment for BI.
The introduction of the UDM is said to blur the line between relational and multidimensional
database architectures. This approach is new for the Microsoft BI platform. What are the
most interesting features the UDM offers? And based on your experience, what features do
you think will surface as the most valuable for customers and ISVs?
Ultimately, OLAP is cool because it brings together navigation and query. Pivoting and drilling down
are really just queries. But the OLAP world has never been attribute-rich; OLAP engines have never
had good ways to express attributes, and adding something as simple as a phone number to a
dimension would have caused size and performance issues in earlier SQL Server releases. With the
UDM, we bridge the hierarchical drill-down world and the attribute-reporting world to present a
dimensional view of data without losing the rich attributes present in the data.
The UDM is also the place where we express business logic, since MDX calculated members and
cells are expressions of business logic. The UDM adds time intelligence, account intelligence, and key
performance indicators (KPIs). You might think KPIs are only calculations, but they are much more.
A SQL Server 2005 KPI includes the calculation, an expression for the goal, an expression for the
trend, and a means of visualizing the results. KPIs are first-class elements of the UDM.
What tools will Microsoft add to the Visual Studio 2005 IDE to help developers create and
manage SQL Server (and other database platforms’) users, groups, and permissions to better
insulate private data from those who shouldn’t have access?
The Visual Studio and SQL Server development teams work together on integration and new methods
of managing data. Our team supplies components to Visual Studio, and they supply components to
SQL Server. In SQL Server 2005, we’ve added Data Insulation features to the core SQL Server engine.
The end result is that developers using Visual Studio can easily create the database elements they
Brought to you by Microsoft and Windows IT Pro eBooks
102
A Jump Start to SQL Server BI
need for their application. For enterprise-management activities, we anticipate that people will use
SQL Server Management Studio.
In one of your past conference keynote addresses, you mentioned that Microsoft is adding a
new set of controls to Visual Studio 2005 to permit reporting without Reporting Services.
Could you describe what those controls will do, when we’ll see the controls appear in Visual
Studio 2005, and where you expect them to be documented?
The reporting controls will ship with Visual Studio 2005 and SQL Server 2005, and they will enable
programmers to embed reporting in their applications. We support both WinForms and WebForms.
Programmers will either provide Report Definition Language (RDL) and a data set to the reporting
control or point to an existing Reporting Services server. We think every application of any
sophistication can use at least a little reporting against data contained in the application. These
controls just make it easier.
What benefit does 64-bit bring to SQL Server BI, and do you think 64-bit can really help the
Microsoft BI platform scale to the levels that UNIX-based BI platforms scale to today?
In a word: memory. The 64-bit architecture lets customers break out of the 3GB memory limit that
they have with 32-bit SQL Server, which allows for far larger dimensions in OLAP. It also enables the
new ETL engine in SQL Server Integration Services to hold more data and process rows that much
faster. And yes, we absolutely think we will reach into the upper ranges of scale with 64-bit.
Who are some BI vendors you’re working closely with to develop 64-bit BI computing?
What’s important to recognize is not which vendors support 64-bit, but that SQL Server 2005 supports
both 32-bit and 64-bit on Intel and AMD platforms. Our customers and partners can start with
32-bit and easily move to 64-bit later or take existing 32-bit applications to 64-bit with near-total
transparency. This support means our customers and partners don’t have to worry about the
differences because they are quite small and well documented.
Did you leave out any BI features that you planned to add to SQL Server 2005 because of
deadlines or other issues?
We are confident that SQL Server 2005 will offer a comprehensive BI solution to address our
customers’ business problems. We’ve worked closely with our customers for several years to
determine their pain points and create BI tools that provide relief. We started delivering those tools
with SQL Server 7.0 and OLAP and continued with SQL Server 2000, DTS, and Reporting Services.
With SQL Server 2005, customers will have the complete package to integrate, analyze, and report
data. Even after all that, we still have a million ideas! We’re already dreaming of what we can do
beyond Yukon, so you can bet we’ll be charged up for the next round—right after we ship SQL
Server 2005. It’s too early to discuss specifics, but as always, we’ll work with our customers to
determine new features and technologies.
Your team puts a lot of long hours into your work on SQL Server BI. What drives you and
your BI developers to invest so much personally in the product?
Even when I started with the BI team 8 years ago, we said the same thing we say now: Companies
improve when more of their employees use BI. “BI for the Masses” is very motivating. Unlike some
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 1 Building Better BI in SQL Server 2005 103
of our competitors, our team is not working to provide a response to competitive offerings. The team
works hard for the purpose of improving our product to best meet our customers’ needs. It might
sound corny, but it truly is as much a journey as it is a destination. The personal investment across
the board is impressive and humbling, and I’m awed by the effort our team contributes every single
day. I hope it shows in our product.
Brought to you by Microsoft and Windows IT Pro eBooks
104
A Jump Start to SQL Server BI
Chapter 2:
UDM: The Best of Both Worlds
By Paul Sanders
The next release of Analysis Services, coming in SQL Server Yukon, will combine the best aspects of
traditional OLAP-based analysis and relational reporting into one dimensional model—the Unified
Dimensional Model (UDM)—that covers both sets of needs. Compared to direct access of a relational
database, OLAP technology provides many benefits to analytic users. OLAP’s dimensional data model
makes it easy to understand, navigate, and explore the data. And OLAP’s precalculation of aggregate
data enables fast response time to ad hoc queries, even over large data volumes. An analytic engine,
supporting the Multidimensional Expression (MDX) query language, lets you perform analytic calculations. And OLAP’s data model includes rich metadata that lets you employ user-friendly, business-oriented names, for example.
However, reporting directly from the underlying relational database still has its advantages. OLAP,
traditionally oriented around star or snowflake schemas, doesn’t handle the arbitrary, complex relationships that can exist between tables. Reporting on the underlying database lets you handle flexible
schema. OLAP cubes also expose data in predetermined hierarchies, making it unfeasible to provide
true ad hoc query capability over tables that have hundreds of columns. Directly accessing the relational store means that results are realtime, immediately reflecting changes as they’re made, and you
can drill down to the full level of detail. In addition, by not introducing a separate OLAP store, you
have less management and lower total cost of ownership (TCO). Table 1 compares the advantages of
relational versus OLAP-based reporting.
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 2 UDM: The Best of Both Worlds 105
Table 1
Many relational-based reporting tools try to gain some of OLAP’s advantages by providing a
user-oriented data model on top of the relational database and routing reporting access through that
model. So, the many enterprises that need to provide both OLAP and relational reporting commonly
end up with multiple reporting tools, each with different proprietary models, APIs, and end-user
tools. This duplication of models results in a complex, disjointed architecture. Analysis Services’ new
UDM, however, combines the best of OLAP and relational approaches to enhance reporting functionality and flexibility.
The UDM Architecture
You define a UDM over a set of data sources, providing an integrated view of data that end users
access. Client tools—including OLAP, reporting, and custom business intelligence (BI) applications—
access the data through the UDM’s industry-standard APIs, as the diagram in Figure 1 shows. A UDM
has four key elements: heterogeneous data access, a rich end-user model, advanced analytics, and
proactive caching. In tandem, these elements transform sometimes difficult-to-understand data into a
coherent, integrated model. Although the UDM enables a range of new data-access scenarios, it
builds on SQL Server 2000 Analysis Services, allowing easy migration from Analysis Services 2000 and
backward compatibility for clients. Let’s look at the UDM’s key components in more detail.
Brought to you by Microsoft and Windows IT Pro eBooks
106
A Jump Start to SQL Server BI
Figure 1
The UDM provides a bridge between end users and their data
Heterogeneous data access. You can build a UDM over a diverse range of data sources, not
just star or snowflake data warehouses. By default, you can expose every column in a table as a
separate attribute of a dimension, enabling exposure of potentially hundreds of dimension-table
columns that users can drill down on. In addition, a cube can contain measures drawn from multiple
fact tables, letting one cube encompass an entire relational database. The model also lets different
kinds of relationships exist between measures and their dimensions, enabling complex relational
schemas. This structure supports degenerate dimensions, letting users drill down to the lowest level
of transaction data. You can also build a UDM over multiple heterogeneous data sources, using
information integrated from different back-end data sources to answer a single end-user query. These
capabilities, combined with unlimited dimension size, let the UDM act as a data-access layer over
heterogeneous sources, providing full access to the underlying data.
Rich end-user model. The UDM lets you define an end-user model over this base data-access
layer, adding the semantics commonly lacking in the underlying sources and providing a comprehensible view of the data that lets users quickly understand, analyze, and act on business information.
The core of a UDM is a set of cubes containing measures (e.g., sales amount, inventory level, order
count) that users can analyze by the details of one or more dimensions (e.g., customer, product).
The UDM builds on Analysis Services 2000’s end-user model, providing significant extensions. For
example, the UDM lets you define Key Performance Indicators (KPIs), important metrics for
measuring your business’s health. Figure 2 shows how a client tool might display three sample KPIs,
organized into display folders.
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 2 UDM: The Best of Both Worlds 107
Figure 2
Three sample KPIs organized into display folders
Advanced analytics. You can augment the end-user model by using a comprehensive,
script-based calculation model to incorporate complex business logic into UDM cubes. The UDM’s
model for defining calculations provides something akin to a multidimensional spreadsheet. For
example, the UDM can calculate the value of a cell—say, AverageSales for the category Bike in the
year 2003—based on the values in other cells. In addition, the UDM might calculate a cell’s value
based not only on the current value of another cell but also on the previous value of that cell. Thus,
the UDM supports simultaneous equations. For example, the UDM might derive profit from revenue
minus expense but derive bonuses (included in expenses) from profit. In addition to providing the
powerful MDX language for authoring such calculations, the UDM integrates with Microsoft .NET,
letting you write stored procedures and functions in a .NET language, such as C# .NET or Visual
Basic .NET, then invoke those objects from MDX for use in calculations.
Proactive caching. The UDM provides caching services that you can configure to reflect
business and technical requirements—including realtime, or near realtime, access to data while
maintaining high performance. The goal of proactive caching is to provide the performance of
traditional OLAP stores while retaining the immediacy and ease of management of direct access to
underlying data sources. Various UDM policy settings control the caching behavior, balancing the
business needs for performance with an acceptable degree of latency. Examples of possible caching
policies might be
• “Answer all queries by using the latest, realtime data.”
• “A 20-minute latency in the data is acceptable. Where possible, use a cache that’s automatically
maintained based on change notifications received from underlying data sources. If at any point
the cache is more than 20 minutes out-of-date, answer all further queries directly from the
underlying source until the cache is refreshed.”
• “Always use a cache. Periodically refresh the cache, avoiding peak-load times on the underlying
sources.”
Brought to you by Microsoft and Windows IT Pro eBooks
108
A Jump Start to SQL Server BI
The UDM also provides a flexible, role-based security model, letting you secure data down to a
fine level of granularity. And Yukon will include a full set of enterprise-class tools for developing and
managing UDMs. The development tools, including an MDX query editor and an MDX debugger, are
integrated with other SQL Server tools for building reports and Data Transformation Services (DTS)
packages as well as with Visual Studio .NET.
One Model for Reporting and Analysis
The UDM combines the best of traditional OLAP and relational reporting, providing a single model
that you can use as the basis for all your reporting and analysis needs. This flexible model allows
data access across multiple heterogeneous data sources, including OLTP databases and data
warehouses. And through the UDM, users can access all data, including the lowest level of transaction detail. With the UDM’s proactive caching, you can define policies to balance performance versus
the need for realtime, or near realtime, information—without having to explicitly manage a separate
Multidimensional OLAP (MOLAP) store. In addition, you can define a rich end-user model, including
complex analytic calculations, to support interactive and managed reporting.
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 3 Data Mining Reloaded 109
Chapter 3:
Data Mining Reloaded
By Alexei Bocharov, Jesper Lind
The two main functions of data mining are classification and prediction (or forecasting). Data mining
helps you make sense of those countless gigabytes of raw data stored in databases by finding
important patterns and rules present in the data or derived from it. Analysts then use this knowledge
to make predictions and recommendations about new or future data. The main business applications
of data mining are learning who your customers are and what they need, understanding where the
sales are coming from and what factors affect them, fashioning marketing strategies, and predicting
future business indicators.
With the release of SQL Server 2000, Microsoft rebranded OLAP Services as Analysis Services to
reflect the addition of newly developed data-mining capabilities. The data-mining toolset in SQL
Server 2000 included only two classical analysis algorithms (Clustering and Decision Trees), a
special-purpose data-mining management and query-expression language named DMX, and limited
client-side controls, viewers, and development tools.
SQL Server 2005 Analysis Services comes with a greatly expanded set of data-mining methods
and an array of completely new client-side analysis and development tools designed to cover most
common business intelligence (BI) needs. The Business Intelligence Framework in SQL Server 2005
offers a new data-mining experience for analysts and developers alike.
Let’s quickly review the data-mining process. Then we’ll explore the seven data-mining
algorithms available in the SQL Server 2005 Analysis Services framework and briefly look at the
“plug-in” technology that can help you add new and custom algorithms to that framework. Although
we couldn’t specifically address the data-mining UI design here, the snapshots included in several
examples will give you a good first look at the power and usability of the new client-side tools.
Mining the Data
The design and deployment of a data-mining application consists of seven logical steps. First, you
prepare the data sources: Identify the databases and connection protocols you want to use. Next, you
describe the data-source views—that is, list tables that contain data for analysis. Third, define the
mining structure by describing which columns you want to use in the models. The fourth step is to
build mining models. SQL Server 2005 gives you seven data-mining algorithms to choose from—you
can even apply several methods in parallel to each mining structure, as Figure 1 shows. The fifth step
is called processing—that’s where you get the mining models to “extract knowledge” from the data
arriving from the data sources. Sixth, you evaluate the results. Using client-side viewers and accuracy
charts, you can present the patterns and predictions to analysts and decision makers, then make
necessary adjustments. Finally, incorporate data mining into your overall data-management routine—
having identified the methods that work best, you’ll have to reprocess the models periodically in
order to track new data patterns. For instance, if your data source is email and your models predict
spam, you’ll need to retrain the models often to keep up with evolving spammer tactics.
Brought to you by Microsoft and Windows IT Pro eBooks
110
A Jump Start to SQL Server BI
Figure 1
A choice of data-mining algorithms
Here’s a quick example of a useful mining model. Let’s say you’re interested in identifying major
groups of potential customers based on census data that includes occupational, demographic, and
income profiles of the population. A good method for identifying large, characteristic census groups is
to use the Clustering algorithm. This algorithm segments the population into clusters so that people in
one cluster are similar and people in different clusters are dissimilar in one or more ways. To
examine those clusters, you can use a tool called Microsoft Cluster Viewer (a standard Analysis Services 2005 component). Figure 2 shows one of the four views, giving you a side-by-side comparison
of all the clusters. For instance, Clusters 6 and 7 correspond to persons not on active military duty.
But Cluster 7 represents people who work longer hours for more income; the top row also suggests
that people in Cluster 7 are mostly married.
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 3 Data Mining Reloaded 111
Figure 2
One of four views using Microsoft Cluster Viewer
Prediction and Mutual Prediction
Suppose you’ve selected just one column (e.g., Income) in a huge data table, designated that column
as Prediction target, and now you’re trying to make some predictions. But you probably won’t get far
by looking at just one column. You can compute the statistical mean and the variance range, but
that’s about it.
Instead, select specific values for one or more other columns (e.g., Age, Years of Experience,
Education, Workload in census data tables) and focus only on those data rows that correspond to the
selected values. You’ll likely find within this subset of rows that the values of the target column fall
into a relatively narrow range—now you can predict the values in the target column with some
degree of certainty. In data-mining terms, we say that those other columns predict the target column.
Figure 3 shows a snapshot of the Dependency Network (DepNet) data-mining control. This
DepNet is a diagram where arrows show which of the census columns predict which others. Some of
the edges (nodes) have arrows pointing both ways; this is called mutual prediction. Mutual prediction
Brought to you by Microsoft and Windows IT Pro eBooks
112
A Jump Start to SQL Server BI
between A and B means that setting values of A reduces the uncertainty in column B, but also the
other way around—picking a value of B would reduce the uncertainty of A.
Figure 3
Snapshop of the Dependency Network data-mining control
All Microsoft data-mining techniques can track prediction, but different algorithms make
predictions in different ways. As we examine the other data-mining methods, we point out the
prediction specifics of each method.
Decision Trees
Prediction is the main idea behind the Microsoft Decision Trees (DT) algorithm. The knowledge that
a DT model contains can be represented graphically in tree form, but it could also appear in the form
of “node rules.” For example, in a census decision tree for Income, a rule such as (Gender = Male
and 1 < YearsWorked < 2) could describe a tree node containing the income statistics for males in
their second year on the job. This node corresponds to a well-defined subpopulation of workers, and
you should be able to make fairly specific predictions with regards to their income. Indeed, one of
the census models gave the following formula under the condition of (Gender = Male and 1 < YearsWorked < 2):
INCOME = 24804.38+425.99*( YRSSRV -1.2)
+392.8*(HOURS-40.2) + 4165.82*(WORKLWK-1.022)
± 24897
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 3 Data Mining Reloaded 113
According to this formula, INCOME is defined mostly by YRSSRV and weekly overtime. (Note
that this is just an example and not based on representative census data.) To obtain this equation in a
visually simple way, you could use the Decision Tree viewer to view the Income tree and zoom in
on a node corresponding to the gender and yearsworked values of interest, as the typical snapshot in
Figure 4 shows.
Figure 4
A typical snapshot
The rule and the formula we’ve discovered identify gender, years of service, years worked,
weekly hours, and workload as predictors for income. Because YRSSRV, HOURS, and WORKLWK
appear in the above formula for INCOME, they’re also called regressors. A decision tree that hosts
such predictive formulas is called a regression tree.
Time Series
The Time Series algorithm introduces the concept of past, present, and future into the prediction
business. This algorithm not only selects the best predictors for a prediction target but also identifies
the most likely time periods during which you can expect to notice the effect of each predicting
factor. For example, having built a model involving monthly primary economic indices, you might
learn that the expected Yen-to-USD currency conversion rate today depends most strongly on the
mortgage rate of 2 months ago and is related to the industrial production index of 7 months ago and
per capita income of 6 to 7 months ago.
Figure 5 shows a data-mining control called Node Legend that gives a graphical view of these
dependencies. The long left-side blue bar next to Mort30 Yr (-2) indicates a negative correlation
between Yen to USD and the mortgage rate 2 months ago—meaning that with time, as one value
goes up, the other value goes down.
Brought to you by Microsoft and Windows IT Pro eBooks
114
A Jump Start to SQL Server BI
Figure 5
A data-mining control called Node Legend
The purple curve (for Yen to USD) and the yellow curve (for the mortgage rate) in Figure 6 offer
a nice graphical representation of this opposing movement of rates. Smaller blue bars in Figure 5
indicate that the exchange rate is to some extent self-sustaining; indeed, they highlight the fact that
the rate today correlates well with the Yen-to-USD rate a month ago (coefficient 0.656) and somewhat
with the rate 2 months ago (coefficient -0.117). So, when refinancing to a lower rate, you might
consider cashing out and investing in Yen-backed securities—but first, you need to look at the
prediction variances (and of course keep mum about the entire scheme).
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 3 Data Mining Reloaded 115
Figure 6
Graphical representation of rates
Clustering and Sequence Clustering
A new feature of Microsoft Clustering algorithms is their ability to find a good cluster count for your
model based on the properties of the training data. The number of clusters should be manageably
small, but a cluster model should have a reasonably high predictive power. You can request either of
the clustering algorithms to pick a suitable cluster count based on a balance between these two
objectives.
Microsoft Sequence Clustering is a new algorithm that you can think of as order-sensitive
clustering. Often, the order of items in a data record doesn’t matter (think of a shopping basket), but
sometimes it’s crucial (think of flights on an itinerary or letters in a DNA code). When data contains
ordered sequences of items, the overall frequencies of these items don’t matter as much as what each
sequence starts and ends with, as well as all the transitions in between.
Our favorite example that shows the benefits of Sequence Clustering is the analysis of Web
click-stream data. Figure 7 shows an example of a browsing graph of a certain group of visitors to a
Web site. An arrow into a Web page node is labeled with the probability of a viewer transitioning to
that Web page from the arrow’s starting point. In the example cluster, news and home are the
viewer’s most likely starting pages (note the incoming arrow with a probability of 0.40 into the news
Brought to you by Microsoft and Windows IT Pro eBooks
116
A Jump Start to SQL Server BI
node and the probability 0.37 arrow into the home node). There’s a 62 percent probability that a
news browser will still be browsing news at the next click (note the 0.62 probability arrow from the
news node into itself), but the browsers starting at home are likely to jump to either local, sport, or
weather. A transition graph such as the one in Figure 7 is the main component of each sequence
cluster, plus a sequence cluster can contain everything an ordinary cluster would.
Figure 7
Example browsing graph of a group of visitors to a Web site
Naive Bayes Models and Neural Networks
These algorithms build two kinds of predictive models. The Microsoft Naïve Bayes (NB) algorithm is
the quickest, although somewhat limited, method of sorting out relationships between data columns.
It’s based on the simplifying hypothesis that, when you evaluate column A as a predictor for target
columns B1, B2, and so on, you can disregard dependencies between those target columns. Thus, in
order to build an NB model, you only need to learn dependencies in each (predictor, target) pair. To
do so, the Naïve Bayes algorithm computes a set of conditional probabilities, such as this one, drawn
from census data:
Probability( Marital = “Single” |
Military = “On Active Duty” ) = 0.921
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 3 Data Mining Reloaded 117
This formula shows that the probability of a person being single while on active duty is quite
different from the overall, population-wide probability of being single (which is approximately 0.4),
so you can conclude that military status is a good predictor of marital status.
The Neural Networks (NN) methodology is probably the oldest kind of prediction modeling and
possibly the hardest to describe in a few words. Imagine that the values in the data columns you
want to predict are outputs of a “black box” and the values in the potential predictor data columns
are inputs to the same black box. Inside the box are several layers of virtual “neurons” that are
connected to each other as well as to input and output wires.
The NN algorithm is designed to figure out what’s inside the box, given the inputs and the
corresponding outputs that are already recorded in your data tables. Once you’ve learned the internal
structure from the data, you can predict the output values (i.e., values in target columns) when you
have the input values.
Association Rules
The Association Rules algorithm is geared toward analyzing transactional data, also known as marketbasket data. Its main use is for high-performance prediction in cross-sell data-mining applications.
This algorithm operates in terms of itemsets. It takes in raw transaction records, such as the one that
Figure 8 shows, and builds a sophisticated data structure for keeping track of counts of items (e.g.,
products) in the dataset.
Figure 8
Raw transaction records
Transaction ID
———————
1
1
2
2
2
Item
——
Bread
Milk
Bread
Milk
Juice
The algorithm creates groups of items (the itemsets) and gathers statistical counts for them. For
Figure 8’s tiny sample record, the statistics would look like Figure 9.
Figure 9
Statistics for Figure 8’s records
Itemset
———————
<Bread, Milk>
<Bread, Juice>
<Milk, Juice>
<Bread, Milk, Juice>
Count
——
2
1
1
1
One of the most important parameters of a model is a threshold for excluding unpopular items
and itemsets. This parameter is called the minimum support. In the preceding example, if you set the
minimum support to 2, the only itemsets retained will be <Bread>, <Milk>, and <Bread, Milk>.
Brought to you by Microsoft and Windows IT Pro eBooks
118
A Jump Start to SQL Server BI
The result of the algorithm is the collection of itemsets and rules derived from the data. Each rule
comes with a score called a lift score and a certain support value larger than or equal to the
minimum support. The lift score measures how well the rule predicts the target item. Once the
algorithm finds the interesting rules, you can easily use them to get product recommendations for
your cross-sell Web sites or direct-mail materials.
Third-Party Algorithms (Plug-Ins)
The seven Microsoft algorithms pack a lot of power, but they might not give you the kind of
knowledge or prediction patterns you need. If this is the case, you can develop a custom algorithm
and host it on the Analysis Server. To fit into the data-mining framework, your algorithm needs to
implement five main COM interfaces:
1. The algorithm-factory interface is responsible for the creation and disposal of the algorithm
instances.
2. The metadata interface ensures access to the algorithm’s parameters.
3. The algorithm interface is responsible for learning the mining models and making predictions
based on these models.
4. The persistence interface supports the saving and loading of the mining models.
5. The navigation interface ensures access to the contents of these models.
Some of these interfaces are elaborate and take getting used to, but implementation templates
are available in the Tutorials and Samples part of the SQL Server 2005 documentation. After you
implement and register your algorithm as a COM object, hooking it up to the Analysis Server is as
easy as adding a few lines to the server configuration.
When the algorithm is ready and hooked up, its functionality immediately becomes available
through the tools in the Business Intelligence Development Studio and SQL Server Management
Studio. Analysis Server treats the new algorithm as its own and takes care of all object access and
query support.
Dig In
Analysis Services 2005 represents a complete redesign of Microsoft’s BI platform. Embracing .NET,
XML for Analysis, and ADOMD.NET, it offers an array of powerful new algorithms, full-featured
designers, and viewers. Even bigger news is how open and transparent the platform has become.
With Analysis Services 2005’s new client APIs, plug-in algorithm capabilities, server object model,
managed user-defined functions (UDFs), and complete Microsoft Visual Studio integration, there’s
virtually no limit to what a motivated BI developer can do.
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 4 What’s New in DTS 119
Chapter 4:
What’s New in DTS
By Kirk Haselden
In early 2000, the Microsoft Data Transformation Services (DTS) development team I work on started
revising DTS with the goals of building on previous success and of improving the product to support
user requests and to provide a richer extraction, transformation, and loading (ETL) platform. We
evaluated every aspect of DTS and decided to totally rewrite it. DTS in the upcoming SQL Server
2005 release, formerly code-named Yukon, sports many brand-new features as well as enhanced
ones. Because so much of DTS is new in SQL Server 2005, I want to show you some of the most
important changes and the new look of the DTS Designer. When I wrote this chapter, I was working
with Beta 1 of SQL Server 2005 DTS, so some features might change in upcoming betas or in the
final release. But if you’re already familiar with SQL Server 2000 and 7.0 DTS releases, you’ll be able
to appreciate the coming improvements.
SQL Server 2005 DTS Design Goals
Because comprehending everything about DTS at a glance is difficult, let’s just take a quick look at
the most important goals and how the goals drove the design and feature decisions the DTS team
made in SQL Server 2005. Although these descriptions are brief, they should help you grasp the
magnitude of the changes.
Provide true ETL capabilities. Although the data pump in pre-SQL Server 2005 DTS is useful
and flexible, most users recognize that it has its limitations and needs to be revamped. For example,
the data pump supports only one source and one destination per pump. True enterprise ETL requires
fast, flexible, extensible, and dependable data movement. SQL Server 2005 DTS provides this
capability through the Data Flow Task—or, as our team calls it, the pipeline. The pipeline supports
multiple sources, multiple transforms, and multiple destinations in one fast, flexible data flow. As of
Beta 1, SQL Server 2005 DTS includes 26 transforms. The Conditional Split and Derived Column
transforms use an expression evaluator to support operations that provide virtually limitless combinations of functionality for processing data. Other transforms such as the Slowly Changing Dimension,
Fuzzy Match, Aggregate, File Extractor, File Inserter, Partition Processing, Data Mining Query,
Dimension Processing, Lookup, Sort, Unpivot, and Data Conversion transforms provide powerful
data-manipulation capabilities that don’t require scripting. This change is a real benefit because users
can develop transformation solutions faster and manage them easier than hand-coded solutions.
Distinguish between data flow, control flow, and event handling. SQL Server 2005 DTS
emphasizes the differences between various kinds of data processing. In current DTS releases, users
are sometimes confused when they try to distinguish between data flow and control flow because
both appear on the DTS Designer surface. In SQL Server 2005 DTS, the concept of data flow includes
all the activities users perform to extract, transform, and load data. Control flow comprises all the
processes that set up a given environment to support ETL, including executing the data flow. SQL
Server 2005 DTS also has event handlers that allow nonsequential control flow execution based on
Brought to you by Microsoft and Windows IT Pro eBooks
120
A Jump Start to SQL Server BI
events that tasks and other objects generate inside a package. SQL Server 2005 DTS clearly
distinguishes between data flow, control flow, and event handling in the UI by showing them in
separate Designer surfaces.
Minimize disk usage. To make DTS into a screaming fast ETL tool, we needed to eliminate
unnecessary disk writes, disk reads, and memory movement. Because ETL solutions can be quite
complex, they typically involve some sort of disk caching and lots of memory movement and
allocations. In some cases, you can’t avoid disk usage—for example, during data extraction, data
loading, or aggregation or sorting of data sets that are larger than available memory. But in many
cases, moving memory and caching aren’t necessary. The pipeline helps eliminate the avoidable
cases by optimizing memory usage and being smart about moving memory only when absolutely
necessary.
Improve scalability. To be accepted as an enterprise ETL platform, SQL Server 2005 DTS
needed the ability to scale. Users in smaller shops might need to run DTS on less-powerful,
affordable commodity hardware, and users in enterprise environments want it to scale up to SMP
production machines. SQL Server 2005 DTS solves this scalability problem by using multiple threads
in one process. This approach is more efficient and uses less memory than using multiple processes.
SQL Server uses this scaling approach successfully, so we decided to use the same method for DTS.
Recognize the development-programming connection. Experienced DTS users know that
developing packages is much like writing code, but DTS in SQL Server 2000 doesn’t support that
connection very well. However, SQL Server 2005 DTS provides a professional development environment that includes projects, deployment, configuration, debugging, source control, and sample code.
Package writers will have the tools they need to effectively write, troubleshoot, maintain, deploy,
configure, and update packages in a fully supported development environment.
Improve package organization. As packages grow in size and complexity, they can sometimes become cluttered and unintelligible. To address users’ concerns about managing larger
packages, our team added more structure for packages and provided ways to better manage the
objects in each package. For example, the DTS runtime, which houses the DTS control flow, now has
containers that isolate parts of a package into smaller, easy-to-organize parts. Containers can hold
other containers and tasks, so users can create a hierarchy of parts within the package.
SQL Server 2005 DTS variables are now scoped, which means that variables in a container are
visible only to the container where the variable is defined and to the container’s children. Containers
also help users define transaction scope. In SQL Server 2005 DTS, users can define transaction scope
by configuring the transaction in a container. Because a package can have multiple containers, one
package can support the creation of multiple independent transactions. Users can also enable and
disable execution of a container and all its children, which is especially useful when you attempt to
isolate parts of the package for debugging or for developing new packages. On SQL Server 2005’s
DTS Designer surface, users can collapse containers to simplify the visible package and view a
package as a collection of constituent compound parts. Variables support namespaces, which simplify
identification and eliminate ambiguity in variable names. All these features let users simplify complex
packages.
Eliminate promiscuous package access. In SQL Server 2005 DTS, the package pointer is no
longer passed in to tasks, so tasks have no way to peruse the package and its contents. This design
change discourages promiscuous access and profoundly affects the way users create DTS packages in
SQL Server 2005 because it enforces declarative package creation, a process similar to coding. The
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 4 What’s New in DTS 121
change also simplifies package maintenance, troubleshooting, debugging, upgrading, and editing
because the package logic is exposed in the Designer, not hidden inside a task.
In SQL Server 2000 DTS, tasks sometimes use the package pointer to promiscuously access the
internals of the package in which they’re running. This practice is a common use for the ActiveX
Script Task. Scripting is desirable, but using the ActiveX Script Task this way creates packages that are
difficult to understand and troubleshoot. It also makes updating packages difficult. For example,
automatically upgrading a package that uses an ActiveX Script Task to loop inside the package is
difficult because an upgrade utility would have to parse the script and modify it to work against the
new object model. Continued support for tasks accessing the package object model would make
upgrading packages to future DTS releases difficult as well. Also, this kind of promiscuous package
access isn’t advisable because in SQL Server 2005 DTS, tasks would interfere with all the services the
DTS runtime provides, causing unpredictable results.
Removing promiscuous access has affected the set of runtime features. Many of the new DTS
runtime features provide alternative ways of performing the functions that, in earlier DTS releases, the
ActiveX Script and Dynamic Properties tasks provided. SQL Server 2005 DTS includes new loop
containers, configurations, property mappings, and expressions that directly target the functional void
that this change creates. These new features are better supported and more consistent and manageable than solutions you code yourself.
So what’s happened to the ActiveX Script Task? Although it has a new, more powerful UI with
integrated debugging, integrated Help, autocomplete, and Intellisense, the ActiveX Script Task, like all
other tasks, is limited to providing only task behavior and can no longer modify the package during
execution.
Isolate tasks in control flow. The focus of control-flow functionality in DTS has shifted from
tasks to the runtime. Although tasks still define package behavior, in SQL Server 2005 DTS, the
runtime directs all aspects of control-flow execution order, looping, package termination, configuration, event handling, connections, and transactions. Tasks in SQL Server 2005 DTS are relatively
isolated objects that have no direct access to the package, other tasks, or external resources. Tasks
work only in their problem domain and only with the parameters the DTS runtime passes to them.
A special container called a Taskhost imposes most of these limits. Taskhosts are generally transparent
to the package writer and perform default behavior on behalf of tasks. Some of the Taskhost’s
benefits are subtle, but one important benefit is that it simplifies writing a task that supports the new
features such as breakpoints and logging.
Connection managers are another feature that extends the runtime’s control over the environment
in which tasks run. Connection managers are similar to connections in DTS in SQL Server 2000 but
more extensive and more important. In SQL Server 2005 DTS, tasks and other objects use connection
managers for accessing all external resources, including data from databases, flat files, Web pages,
and FTP servers. Using connection managers lets the DTS runtime validate, detect, and report when a
connection is using an invalid source or destination. The use of connection managers also lets users
more easily discover what resources a package is accessing. Because resource access is confined to
connection managers and not spread throughout the package in perhaps unknown or hard-to-find
properties on tasks, use of connection managers simplifies package configuration, maintenance, and
deployment.
Improve extensibility. The Microsoft DTS development team wrote SQL Server 2005 DTS with
the understanding that it was to be a true platform. By this, I mean that users can embed DTS in
Brought to you by Microsoft and Windows IT Pro eBooks
122
A Jump Start to SQL Server BI
their applications, write custom components and plug them into DTS, write management UIs for DTS,
or use it for its original purpose—as a utility for moving data. Extensibility is a big part of what
makes the new DTS a platform instead of a simple utility. Customers can still write custom tasks and
custom transforms in SQL Server 2005 DTS, but the product contains new options that let customers
write tasks and transforms by using managed code written in C#, Visual Basic .NET, and other .NET
languages. And SQL Server 2005 DTS still supports writing custom components with Visual Basic (VB)
6.0, C++, and other native development languages.
The new release also includes more types of extensible components. Previous DTS releases
provide connectivity only through OLE DB connections. SQL Server 2005 DTS includes HTTP, FTP,
OLE DB, Windows Management Interface (WMI), Flat File, File, and other connections, and users can
write their own connections if the ones they want aren’t available. If users want to support new
protocols or even new data-access technologies, they can create new connection types to support
them without modifying other DTS components. The extensible connection feature benefits Microsoft
and customers; it makes adding new connections simpler for Microsoft, and customers aren’t limited
to what Microsoft provides.
In SQL Server 2005 DTS, the runtime intrinsically supports looping through two new looping
constructs in the form of containers. The Forloop container evaluates a user-defined expression at the
beginning of each iteration, and the Foreachloop container iterates once for each item in a userprovided collection by using a new type of object called a Foreachenumerator. Because these loop
constructs are containers, users can place tasks and other containers inside them and execute their
contents multiple times. SQL Server 2005 DTS ships with several Foreachenumerators including SQL
Server Management Objects (SMO), generic collection, XML nodelist, ADO, file, and multi-element
enumerators. Foreachenumerators are also extensible, so if users want to support custom collections,
they can write their own enumerators.
SQL Server 2005 DTS supports another new type of object called a log provider. Log providers
handle all the details of creating log entries for a given destination and format. SQL Server 2005 DTS
lets users easily write their own log providers if the ones that ship in the box don’t meet their needs.
The new product will ship with several log provider types, including Text, XML, event log, SQL
Server Profiler, and SQL Server.
The pipeline is also extensible. Users can write custom data adapters and transformations that
plug into the pipeline. Users can also write pipeline data-source adapters to support a particular
source’s format, parse the data, and put it into the pipeline. Likewise, pipeline data-destination
adapters support removing data from the pipeline and loading it to the destination. Pipeline
transforms are components that modify data as it flows through the pipeline. SQL Server 2005 DTS
provides several options for writing pipeline data adapters and transforms, including using native
code, managed code, or the Managed Script Transform.
Redesigning the Designer
SQL Server 2005’s DTS Designer is more capable and powerful than those in earlier DTS releases.
The new DTS Designer is hosted in the Visual Studio shell to take advantage of all the features Visual
Studio provides such as integrated debugging, Intellisense, source control, deployment utilities,
property grids, solution management, and editing support. These features simplify building, managing,
and updating packages. As I mentioned, data flow, control flow, and event handling are separated
into dedicated panes in the DTS Designer. This separation makes it easier for users to see what the
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 4 What’s New in DTS 123
package is doing and to isolate parts of the package. SQL Server 2005 DTS supports debugging with
features such as breakpoints, watches, errors, warnings, informational messages, and progress notifications. Packages now return more targeted and informative error messages that are visible in various
locations in the Designer. Improvements to many UI features make the entire workspace better.
Experienced Visual Studio users will quickly feel at home in the new DTS Designer because it’s so
similar to other Visual Studio applications. But regardless of whether users are familiar with the
environment, it’s intuitive enough that they’ll be comfortable working with it in no time. Let’s look at
a few important new features of the SQL Server 2005 DTS Designer.
Designer control flow. Figure 1 shows the Control Flow view in the SQL Server 2005 DTS
Designer’s Business Intelligence Workbench. In the left pane of the window is a toolbox containing
all the available tasks. Double-clicking a control-flow item or dragging it onto the Designer surface
adds a new instance of the selected task or container to the control flow in the package.
Figure 1
The Control Flow view
On the Control Flow tab, you can see a model of the sequence container. In the model, the
Send Mail Task has an arrow beneath it. To create a precedence constraint, a user needs only to drag
the arrow to another task. The Connections tab, which lists a package’s data source connections, is in
the pane below the design surface. The information in the Connections tab makes connections easier
to find and clarifies the control flow. In SQL Server 2000 DTS, connections and tasks are combined
Brought to you by Microsoft and Windows IT Pro eBooks
124
A Jump Start to SQL Server BI
on one Designer page and are easy to confuse. Our team eliminated this confusion by visually separating connections from tasks. The list in the Variables pane at the bottom of the window includes
each variable’s scope and data type.
The right two panes in Figure 1 are the Solution Explorer and the Properties grid. SQL Server
2005’s DTS Designer supports Visual Studio projects, which keep track of files and settings related to
the environment and the project files. The Solution Explorer provides a central location for managing
projects. In this DTS Designer pane, you can manage Analysis Services and Reporting Services projects so that you can work with your cubes, reports, and packages in one solution. The Properties
grid is a powerful tool for modifying packages. With it, you can view and edit the properties of any
object visible within the DTS Designer, including tasks, precedence constraints, variables, breakpoints,
and connections.
The sample package on the Control Flow tab in Figure 1 shows how you can embed containers
inside each other. The package has a Foreachloop container that holds an XML Task and a Sequence
Container that holds a set of tasks. In SQL Server 2005 DTS, when you delete a container, you also
delete all the tasks and containers it holds because variables and transactions are created on containers. So, a transaction on the Sequence Container would be scoped only to that container’s tasks
and containers and wouldn’t be visible outside the Sequence Container to tasks or containers such as
the Foreachloop or XML Task. This change makes SQL Server 2005 DTS more flexible than DTS in
SQL Server 2000, in which users can create transactions only at the package level.
Designer data flow. Figure 2 shows the DTS Designer’s Data Flow tab, which you can access
by clicking the tab or double-clicking a Data Flow Task. This view is similar to the Control Flow
view, with a few differences. When the Data Flow view is active, the toolbox in the left pane shows
Data Flow Items, including data-source adapters, transforms, and data-destination adapters. To use
these tools, users double-click them or drag them to the Designer surface. Figure 2 also shows the
Output pane. DTS requires validation, which means that a component must confirm that it can successfully run when the package calls its Execute() function. If a component can’t run, it must explain
why—DTS components communicate warnings, errors, or other information by raising events during
package validation and execution. The SQL Server 2005 DTS Designer captures such events in the
output window.
Brought to you by Microsoft and Windows IT Pro eBooks
Section III: New BI Features — Chapter 4 What’s New in DTS 125
Figure 2
DTS Designer’s Data Flow tab
The Properties grid in Figure 2 shows a couple of interesting links. The Show Editor link, like the
link of the same name on the Control Flow view’s Properties grid, opens the editor for the currently
selected transform. The Show Advanced Editor link shows a generic editor that lets users edit
transforms that have no custom UI. Because transforms in a Data Flow Task don’t execute in
sequence, the DataView instead provides Data Viewers, UI elements that let users view data while
it’s passing between transforms. Data Viewers are a powerful debugging feature that helps package
writers understand what’s happening inside the pipeline.
Migration Pain
After reading about all the improvements, changes, and new features in DTS, you might wonder how
the new product will work with legacy DTS packages. You might even anticipate problems with
upgrading pre-SQL Server 2005 packages—and you’d be right. Early in the redesign of DTS, when we
realized that we had to change the object model drastically, we also realized that the upgrade path
from SQL Server 2000 DTS to SQL Server 2005 DTS would be difficult. After a lot of sometimesheated discussion, we decided that our customers would benefit most if the product was free from
the limitation of strict backward compatibility so that the next generation of DTS would be based on
Brought to you by Microsoft and Windows IT Pro eBooks
126
A Jump Start to SQL Server BI
a more flexible design. Customers we spoke to told us that this choice was acceptable as long as we
didn’t break their existing DTS packages.
By now you’ve probably guessed that some of your packages won’t upgrade completely.
However, we’ve provided some upgrade options that you can use to help ensure a smooth migration
to SQL Server 2005 DTS. The first option is to run your existing packages as you always have. The
SQL Server 2000 DTS bits will ship with SQL Server 2005, so you’ll still be able to execute your SQL
Server 2000 DTS packages. The second option is to run SQL Server 2000 DTS packages inside SQL
Server 2005 packages. You can do this by using the new ExecuteDTS2000Package Task, which wraps
the SQL Server 2000 package in a SQL Server 2000 environment inside the SQL Server 2005 package.
The ExecuteDTS2000Package Task will successfully execute your legacy packages and is useful in
partial-migration scenarios while you’re transitioning between SQL Server 2000 and SQL Server 2005
DTS.
If you want to upgrade your packages, you have a third option. SQL Server 2005 DTS will ship
with a “best effort” upgrade wizard called the Migration Wizard that will move most of the packages
that you generated by using the SQL Server 2000 DTS Import/Export wizard. If you have an ActiveX
Script Task or a Dynamic Properties Task in your package, it probably does something that SQL
Server 2005 DTS no longer allows, such as modifying other tasks or modifying the package. The
migration wizard won’t be able to migrate those parts of the package. However, you can migrate
packages a little at a time because SQL Server 2005 DTS will support SQL Server 2000 DTS side-byside execution.
Fresh Faces, SDK, and Other Support
Except for some new names, the Import/ Export Wizard and command-line utilities remain largely
unchanged. DTSRun.exe is called DTExec.exe in SQL Server 2005 DTS. DTSRunUI.exe is called
DTExecUI in SQL Server 2005 DTS, and it features a face-lift. We added a new command-line utility
called DTUtil.exe that you can use for performing common administrative tasks such as moving,
deleting, and copying packages. The utility also performs other tasks such as checking for the
existence of packages. In addition, we included a new configuration wizard called the Package
Configurations Organizer for creating package configurations, project-development capabilities that
bundle a package with its configuration, and a self-installing executable for deploying packages to
other machines.
SQL Server 2005 DTS might ship with a software development kit (SDK). However, as of Beta 1,
the plan for the SDK is still undefined. Some features you might expect are a task wizard, a transform
wizard, and other component-creation wizards. More information about the SDK should be available
as SQL Server 2005 gets closer to shipping.
That’s the whirlwind tour of SQL Server 2005 DTS. As you can see, most of the concepts remain
the same, but the product is brand-new. Indeed, by the time you read this, DTS might even have a
new name that reflects that fact.
Brought to you by Microsoft and Windows IT Pro eBooks
Chapter 1 Chapter title 127
Brought to you by Microsoft and Windows IT Pro eBooks
128
A Jump Start to SQL Server BI
Brought to you by Microsoft and Windows IT Pro eBooks
Chapter 1 Chapter title 129
Brought to you by Microsoft and Windows IT Pro eBooks
130
A Jump Start to SQL Server BI
Brought to you by Microsoft and Windows IT Pro eBooks
Chapter 1 Chapter title 131
Brought to you by Microsoft and Windows IT Pro eBooks
132
A Jump Start to SQL Server BI
Brought to you by Microsoft and Windows IT Pro eBooks
Chapter 1 Chapter title 133
Brought to you by Microsoft and Windows IT Pro eBooks
134
A Jump Start to SQL Server BI
Brought to you by Microsoft and Windows IT Pro eBooks
Download