ITPro ™ SERIES Books A Jump Start to SQL Server BI Don Awalt Larry Barnes Alexei Bocharov Herts Chen Rick Dobson Rob Ericsson Kirk Haselden Brian Lawton Jesper Lind Tim Ramey Paul Sanders Mark D. Scott David Walls Russ Whitney i Contents Section I: Essential BI Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1: Data Warehousing: Back to Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 By Don Awalt, Brian Lawton Common Terms . . . . . . . . . . Establishing a Vision . . . . . . Defining Scope . . . . . . . . . The Essence of Warehousing The Rest Is Up to You . . . . . . . . . 2 3 3 4 5 Chapter 2: 7 Steps to Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 by Mark D. Scott, David Walls Step 1: Determine Business Objectives . . . . . . . . . . . . . . Step 2: Collect and Analyze Information . . . . . . . . . . . . . Step 3: Identify Core Business Processes . . . . . . . . . . . . . Step 4: Construct a Conceptual Data Model . . . . . . . . . . . Step 5: Locate Data Sources and Plan Data Transformations Step 6: Set Tracking Duration . . . . . . . . . . . . . . . . . . . . . Step 7: Implement the Plan . . . . . . . . . . . . . . . . . . . . . . . 6 7 7 8 8 9 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3: The Art of Cube Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 By Russ Whitney, Tim Ramey Designing a Sales-Forecasting Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Providing Valid Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 Chapter 4: DTS 2000 in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 By Larry Barnes Introducing the Create FoodMart 2000 Package . . . Initializing Global Variables and the Package State Preparing the Existing Environment . . . . . . . . . . Creating the FoodMart Database and Tables . . . . Change Is Good . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 20 25 36 Chapter 5: Rock-Solid MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 By Russ Whitney ii A Jump Start to SQL Server BI Chapter 6: XML for Analysis: Marrying OLAP and Web Services . . . . . . . . . . . . . 45 By Rob Ericsson Installing XMLA . . . . . . . . . . . . . . Using XMLA: Discover and Execute Getting Results . . . . . . . . . . . . . . A Convenient Marriage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 48 52 56 Chapter 7: Improving Analysis Services Query Performance . . . . . . . . . . . . . . . 57 By Herts Chen Traffic-Accident Data Warehouse Queries and Bottlenecks . . . . . . Usage-Based Partitioning . . . . . Partition Testing . . . . . . . . . . . . Guidelines for Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 58 61 62 64 Chapter 8: Reporting Services 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 By Rick Dobson Installing Reporting Services . . . . Creating Your First Report . . . . . Creating a Drilldown Report . . . . Deploying a Solution . . . . . . . . . Viewing Deployed Solution Items Beyond the Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 68 70 70 72 Section II – BI Tips and Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Improve Performance at the Aggregation Level . . . . . . . Using Children to Automatically Update Products . . . . . Saving DTS Information to a Repository . . . . . . . . . . . . Intelligent Business . . . . . . . . . . . . . . . . . . . . . . . . . . Techniques for Creating Custom Aggregations . . . . . . . . Using Loaded Measures to Customize Aggregations . . . . Caution: Large Dimensions Ahead . . . . . . . . . . . . . . . . Decoding MDX Secrets . . . . . . . . . . . . . . . . . . . . . . . . Improve Cube Processing by Creating a Time Dimension Transforming Data with DTS . . . . . . . . . . . . . . . . . . . . Supporting Disconnected Users . . . . . . . . . . . . . . . . . . Dependency Risk Analysis . . . . . . . . . . . . . . . . . . . . . Choosing the Right Client for the Task . . . . . . . . . . . . . Using Access as a Data Source . . . . . . . . . . . . . . . . . . Calculating Utilization . . . . . . . . . . . . . . . . . . . . . . . . . Use Member Properties Judiciously . . . . . . . . . . . . . . . Get Level Names Right from the Get-Go . . . . . . . . . . . Aggregating a Selected Group of Members . . . . . . . . . . Determining the Percentage of a Product’s Contribution . ..... ..... ..... ..... ..... ..... ..... ..... Table ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 74 74 75 76 76 77 77 78 80 81 82 83 83 83 85 85 85 86 iii Avoid Crippled Client Software . . . . . . . . . . . . . Setting OLAP Cube Aggregation Options . . . . . . Use Views as the Data Source . . . . . . . . . . . . . . Enter Count Estimates . . . . . . . . . . . . . . . . . . . Using Dynamic Properties to Stabilize DTS . . . . . Leave Snowflakes Alone . . . . . . . . . . . . . . . . . . Create Grouping Levels Manually . . . . . . . . . . . Understand the Role of MDX . . . . . . . . . . . . . . Using NON EMPTY to Include Empty Cells . . . . Formatting Financial Reports . . . . . . . . . . . . . . . Analyzing Store Revenue . . . . . . . . . . . . . . . . . Use Counts to Analyze Textual Information . . . . Consolidation Analysis . . . . . . . . . . . . . . . . . . . Working with Analysis Services Programmatically Filtering on Member Properties in SQL Server 7.0 Improving Query Performance . . . . . . . . . . . . . Using SQL ALIAS for the AS/400 . . . . . . . . . . . . Setting Up English Query . . . . . . . . . . . . . . . . . When Do You Use Web Services? . . . . . . . . . . . The Security Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 87 87 88 88 88 89 89 89 90 90 91 92 93 93 94 95 96 96 96 Section III – New BI Features in SQL Server 2005 . . . . . . . . . . . . . . . . . . 98 Chapter 1: Building Better BI in SQL Server 2005 . . . . . . . . . . . . . . . . . . . . . . . . 99 How are SQL Server 2005’s BI enhancements meeting Microsoft’s goals for serving the BI community? And how long has your team has been working on these enhancements? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What kind of feedback have you been getting from beta testers, and which features are they most enthusiastic about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . According to news reports, Microsoft and some large customers have deployed SQL Server 2005 Beta 2 in production environments. What is your recommendation for deploying Beta 2 and running it in production? What caveats do you have for businesses eager to move to the new version now? . . . . . . . . . . . . . . . . . . . . . How compatible are SQL Server 2000’s BI tools (OLAP, DTS, data mining) and SQL Server 2005’s new BI tools? Because some of SQL Server 2005’s BI tools— such as Integration Services—are completely rewritten, will they still work with SQL Server 2000 data and packages? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SQL Server 2000 Analysis Services supports only clustering and decision-tree data-mining algorithms. Does SQL Server 2005 add support for other algorithms? . . . Microsoft relies on an integrated technology stack—from OS to database to user interface. How does that integration help Microsoft’s BI offerings better serve your customers’ needs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . 99 . 100 . 100 . 100 . 100 iv A Jump Start to SQL Server BI SQL Server 2005 will be the first release in which database tools converge with Visual Studio development tools. Can you tell us what it took to align these two releases and what benefits customers will realize from the change? . . . The introduction of the UDM is said to blur the line between relational and multidimensional database architectures. This approach is new for the Microsoft BI platform. What are the most interesting features the UDM offers? And based on your experience, what features do you think will surface as the most valuable for customers and ISVs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What tools will Microsoft add to the Visual Studio 2005 IDE to help developers create and manage SQL Server (and other database platforms’) users, groups, and permissions to better insulate private data from those who shouldn’t have access? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . In one of your past conference keynote addresses, you mentioned that Microsoft is adding a new set of controls to Visual Studio 2005 to permit reporting without Reporting Services. Could you describe what those controls will do, when we’ll see the controls appear in Visual Studio 2005, and where you expect them to be documented? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What benefit does 64-bit bring to SQL Server BI, and do you think 64-bit can really help the Microsoft BI platform scale to the levels that UNIX-based BI platforms scale to today? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Who are some BI vendors you’re working closely with to develop 64-bit BI computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Did you leave out any BI features that you planned to add to SQL Server 2005 because of deadlines or other issues? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Your team puts a lot of long hours into your work on SQL Server BI. What drives you and your BI developers to invest so much personally in the product? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 . . . 101 . . . 101 . . . 102 . . . 102 . . . 102 . . . 102 . . . 102 Chapter 2: UDM: The Best of Both Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 By Paul Sanders The UDM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 One Model for Reporting and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Chapter 3: Data Mining Reloaded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 By Alexei Bocharov, Jesper Lind Mining the Data . . . . . . . . . . . . . . . . . . Prediction and Mutual Prediction . . . . . . . Decision Trees . . . . . . . . . . . . . . . . . . . Time Series . . . . . . . . . . . . . . . . . . . . . Clustering and Sequence Clustering . . . . . Naive Bayes Models and Neural Networks Association Rules . . . . . . . . . . . . . . . . . . Third-Party Algorithms (Plug-Ins) . . . . . . . Dig In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 111 112 113 115 116 117 118 118 Chapter 4: What’s New in DTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 By Kirk Haselden SQL Server 2005 DTS Design Goals . . Redesigning the Designer . . . . . . . . . Migration Pain . . . . . . . . . . . . . . . . Fresh Faces, SDK, and Other Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 122 125 126 1 Section I Essential BI Concepts Brought to you by Microsoft and Windows IT Pro eBooks 2 A Jump Start to SQL Server BI Chapter 1: Data Warehousing: Back to Basics By Don Awalt, Brian Lawton So, you’re about to undertake your first data-warehousing project. Where will you begin? Or maybe you’re already implementing a warehouse, but the project is going awry and you’re trying to get it back on track. What do you need to know to make it successful? Let’s step back from the implementation details and examine some analysis and design roadblocks you need to overcome on your road to a successful data warehouse deployment. Along the way, we’ll review the common terminology you need to understand and discuss some challenges you’ll face on your quest. Following these guidelines can boost your chances for a successful data warehouse deployment. Common Terms First, let’s define the crucial pieces of the project: a data warehouse, a data mart, and data warehousing. Although they’re often used interchangeably, each has a distinct meaning and impact on the project. A data warehouse is the cohesive data model that defines the central data repository for an organization. An important point is that we don’t define a warehouse in terms of the number of databases. Instead, we consider it a complete, integrated data model of the enterprise, regardless of how or where the data is stored. A data mart is a repository containing data specific to a particular business group in an enterprise. All data in a data mart derives from the data warehouse, and all data relates directly to the enterprisewide data model. Often, data marts contain summarized or aggregated data that the user community can easily consume. Another way to differentiate a data warehouse from a data mart is to look at the data’s consumers and format. IT analysts and canned reporting utilities consume warehouse data, whose storage is usually coded and cryptic. The user community consumes data mart data, whose storage is usually in a more readable format. For example, to reduce the need for complex queries and assist business users who might be uncomfortable with the SQL language, data tables could contain the denormalized code table values. Finally, data warehousing is the process of managing the data warehouse and data marts. This process includes all the ongoing support needs of the refresh cycle, database maintenance, and continual refinements to the underlying data model. One important aspect of developing a warehouse is having a data dictionary that both the project team and the user community use to derive definitions and understand data elements within the warehouse. This statement seems simple, but when you’re pulling data together from many source systems, a common implementation problem (which people usually don’t identify until after deployment) involves reconciling similarly named data elements that come from different systems and have subtle differences in meaning. An example of this problem in the health care community is the attribute attending physician. One system, which tracks daily patient activity, defines this term as the physician currently responsible for the patient’s care. At the same facility, a system that focuses on Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 1 Data Warehousing: Back to Basics 3 patient billing defines it as the physician most affiliated with the visit. Both definitions are correct in their contexts, but the difference illustrates a challenge in trying to combine the two systems’ data. The health care example illustrates a symptom of what we consider the biggest challenge in a data-warehousing project: bringing together the right people from the user and development communities to create a project team. The right people have the business knowledge to explain how the data is used, the systems knowledge to extract it, and the analytical and design skills to bring it together into a warehouse. The difference between other projects and building a warehouse is that individual projects usually focus on one business area, whereas building a data warehouse focuses on combining the data and subsequent knowledge from many projects. The team must have the depth and breadth to cover all the systems involved. Establishing a Vision Now that we’ve identified the largest risk area, let’s look at some steps you can take to minimize the risk. To put together the right project team, first define the project’s vision and begin to establish its scope. After you do so, you’ll see more clearly which users and IT staff members you need to involve. The vision’s purpose is to define the project’s ultimate mission from a business perspective. In theory, all work on the project directly or indirectly supports the objectives outlined in the vision. Defining a clear, tangible mission for the project is crucial. When articulated properly, the vision defines relative priorities for the team—schedule, features, and cost. You use it to resolve requirement and implementation decisions throughout the development lifecycle: Tailor your decisions to support the mission and priorities of the vision; omit or defer others to later iterations. The vision creates a theme for the project that serves the entire project development cycle. At the highest level, you require all project activities to achieve the vision’s objectives. For example, let’s look at a growing health care organization in which each facility maintains a separate information system. The vision for its warehousing project might be to provide the ability to review, analyze, and track historical data across all facilities in an appropriate and meaningful context. This vision describes an objective that implementing a data warehouse can accomplish. Defining Scope After you’ve established the project’s vision, you can set its scope. Next to fielding the wrong team, the inability to define the right scope puts a project at most risk for failure. Scope refers to the potential size of the undertaking—what will be delivered successfully in a meaningful time frame. Often a warehousing project tries to deliver too much, which can result in the project falling dramatically behind schedule or even being canceled. The other extreme, building stovepipes, happens when an organization decides to use many small databases to focus on discrete business areas. Although these combined databases might look like a data warehouse, they’re really dataaccess enhancements (or reporting enhancements) to the operational systems. This implementation isn’t a true data warehouse because stovepipes are independent units with no cohesive data model tying them together. In the context of data warehousing, stovepipes achieve no enterprise-level business objectives. Understanding the definitions we gave earlier is important for arriving at the right scope for the project. Although, by definition, the data warehouse takes into consideration the entire business, you Brought to you by Microsoft and Windows IT Pro eBooks 4 A Jump Start to SQL Server BI don’t need to implement it all at once. When you focus on individual business units within the overall model, design and development proceed iteratively, and you implement one or two areas at a time. Iterative development results in a faster return on investment when you prioritize the business area development, rather than waiting to roll out one massive warehouse at the end. From a scope perspective, you control the size, timing, and cost of each iteration without compromising the integrity of the overall data warehouse. An often-overlooked aspect of the project is building the infrastructure to sustain the process of data warehousing. All too many warehousing projects break down after deployment because people fail to recognize the ongoing support costs (resources, time, and coordination) of refreshing the data. You might have designed the world’s best data model and implemented a great database, but if users don’t receive data in a reliable and timely manner, they’ll consider the project a failure. Depending on the volatility of the source system data, warehouse data can quickly go stale. To determine the warehouse’s refresh intervals, you must have project requirements that identify the rate of change in the source system data and how often the user community needs to see those changes reflected. Our experience shows that building the appropriate infrastructure to support the data warehousing aspect of the project is as important as designing the data model. So factor the ongoing support needs and the corresponding infrastructure development costs (e.g., to sustain the timely refresh of the data) into the project’s scope. The Essence of Warehousing So far, we’ve focused on some of the project-planning issues and high-level design considerations involved in building a warehouse. Now it’s time to examine the essence of data warehousing: data acquisition, data transformation, and data presentation. These areas constitute the ongoing process of data warehousing and require a full understanding to avoid data refresh problems. Data acquisition is the task of bringing data from everywhere to the data warehouse. Most businesses have several operational systems that handle the organization’s day-to-day processing. These systems serve as the data source for the warehouse. The systems might reside on a mainframe, in a client/server database application, in a third-party application with a proprietary data store, within desktop applications such as spreadsheets and database applications, or any combination of these. The challenge is to identify the data sources and develop a solution for extracting and delivering the data to the warehouse in a timely, scheduled manner. After collecting the data, you need to transform it. In an ideal organization, all systems would use the same set of codes and definitions for all data elements. In the real world, as we showed earlier, different codes and definitions exist for what appear to be the same data element. Data transformation is the cleansing and validation of data for accuracy, and ensuring that all values conform to a standard definition. After these data transformation tasks are complete, you can add the data to the warehouse. Finally, you’re ready for data presentation. At this point, the warehouse contains a large, normalized data store containing all (or part) of the organization’s data. Great! Unfortunately, the users who need this data can’t make sense of it because of its cryptic coding schemes and normalized storage. Data presentation involves taking the data from the data warehouse and getting it into the hands of users in a usable, easy-to-understand format. One way to present the data is to deploy a data mart containing summarized, aggregated data. Or you can put an OLAP engine Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 1 Data Warehousing: Back to Basics 5 between the warehouse and the user. Another option is to custom-build a reporting tool or deploy third-party solutions. Identify the most effective way to present the data, and implement it. In completing these tasks, keep in mind that the data that users receive needs to be consistent, accurate, and timely. Failure to ensure quality data delivery could jeopardize the project’s success because users won’t work with inaccurate or old data. One way to minimize the risks of bad data is to involve users in the cleansing, validation, and transformation steps of the data transformation task. The more input and familiarity users have with the data validations and transformations, the more confident they’ll be about the accuracy of the resulting warehouse data. Also, emphasize to the users the importance of their input into the data validation process. Explain to them that their experience and knowledge make them a necessary part of the project team and ensure the data’s validity and integrity. The Rest Is Up to You So as you embark on your data warehousing adventure, remember these basic ideas. Carefully define the project’s vision and the scope of the first iteration. Inform and involve your users. Know and understand the three major tasks of implementation—data acquisition, data transformation, and data presentation. Finally, during design always keep in mind the consistency, accuracy, and timeliness of the ongoing data delivery. Although we can’t guarantee that your warehousing project won’t fail, following the basics discussed here will give you a much better chance of success. Brought to you by Microsoft and Windows IT Pro eBooks 6 A Jump Start to SQL Server BI Chapter 2: 7 Steps to Data Warehousing By Mark D. Scott, David Walls Data warehousing is a business analyst’s dream—all the information about the organization’s activities gathered in one place, open to a single set of analytical tools. But how do you make the dream a reality? First, you have to plan your data warehouse system. You must understand what questions users will ask it (e.g., how many registrations did the company receive in each quarter, or what industries are purchasing custom software development in the Northeast) because the purpose of a data warehouse system is to provide decision-makers the accurate, timely information they need to make the right choices. To illustrate the process, we’ll use a data warehouse we designed for a custom software development, consulting, staffing, and training company. The company’s market is rapidly changing, and its leaders need to know what adjustments in their business model and sales practices will help the company continue to grow. To assist the company, we worked with the senior management staff to design a solution. First, we determined the business objectives for the system. Then we collected and analyzed information about the enterprise. We identified the core business processes that the company needed to track, and constructed a conceptual model of the data. Then we located the data sources and planned data transformations. Finally, we set the tracking duration. Step 1: Determine Business Objectives The company is in a phase of rapid growth and will need the proper mix of administrative, sales, production, and support personnel. Key decision-makers want to know whether increasing overhead staffing is returning value to the organization. As the company enhances the sales force and employs different sales modes, the leaders need to know whether these modes are effective. External market forces are changing the balance between a national and regional focus, and the leaders need to understand this change’s effects on the business. To answer the decision-makers’ questions, we needed to understand what defines success for this business. The owner, the president, and four key managers oversee the company. These managers oversee profit centers and are responsible for making their areas successful. They also share resources, contacts, sales opportunities, and personnel. The managers examine different factors to measure the health and growth of their segments. Gross profit interests everyone in the group, but to make decisions about what generates that profit, the system must correlate more details. For instance, a small contract requires almost the same amount of administrative overhead as a large contract. Thus, many smaller contracts generate revenue at less profit than a few large contracts. Tracking contract size becomes important for identifying the factors that lead to larger contracts. As we worked with the management team, we learned the quantitative measurements of business activity that decision-makers use to guide the organization. These measurements are the key performance indicators, a numeric measure of the company’s activities, such as units sold, gross Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 2 7 Steps to Data Warehousing 7 profit, net profit, hours spent, students taught, and repeat student registrations. We collected the key performance indicators into a table called a fact table. Step 2: Collect and Analyze Information The only way to gather this performance information is to ask questions. The leaders have sources of information they use to make decisions. Start with these data sources. Many are simple. You can get reports from the accounting package, the customer relationship management (CRM) application, the time reporting system, etc. You’ll need copies of all these reports and you’ll need to know where they come from. Often, analysts, supervisors, administrative assistants, and others create analytical and summary reports. These reports can be simple correlations of existing reports, or they can include information that people overlook with the existing software or information stored in spreadsheets and memos. Such overlooked information can include logs of telephone calls someone keeps by hand, a small desktop database that tracks shipping dates, or a daily report a supervisor emails to a manager. A big challenge for data warehouse designers is finding ways to collect this information. People often write off this type of serendipitous information as unimportant or inaccurate. But remember that nothing develops without a reason. Before you disregard any source of information, you need to understand why it exists. Another part of this collection and analysis phase is understanding how people gather and process the information. A data warehouse can automate many reporting tasks, but you can’t automate what you haven’t identified and don’t understand. The process requires extensive interaction with the individuals involved. Listen carefully and repeat back what you think you heard. You need to clearly understand the process and its reason for existence. Then you’re ready to begin designing the warehouse. Step 3: Identify Core Business Processes By this point, you must have a clear idea of what business processes you need to correlate. You’ve identified the key performance indicators, such as unit sales, units produced, and gross revenue. Now you need to identify the entities that interrelate to create the key performance indicators. For instance, at our example company, creating a training sale involves many people and business factors. The customer might not have a relationship with the company. The client might have to travel to attend classes or might need a trainer for an on-site class. New product releases such as Windows 2000 (Win2K) might be released often, prompting the need for training. The company might run a promotion or might hire a new salesperson. The data warehouse is a collection of interrelated data structures. Each structure stores key performance indicators for a specific business process and correlates those indicators to the factors that generated them. To design a structure to track a business process, you need to identify the entities that work together to create the key performance indicator. Each key performance indicator is related to the entities that generated it. This relationship forms a dimensional model. If a salesperson sells 60 units, the dimensional structure relates that fact to the salesperson, the customer, the product, the sale date, etc. Then you need to gather the key performance indicators into fact tables. You gather the entities that generate the facts into dimension tables. To include a set of facts, you must relate them to the dimensions (customers, salespeople, products, promotions, time, etc.) that created them. For the fact Brought to you by Microsoft and Windows IT Pro eBooks 8 A Jump Start to SQL Server BI table to work, the attributes in a row in the fact table must be different expressions of the same event or condition. You can express training sales by number of seats, gross revenue, and hours of instruction because these are different expressions of the same sale. An instructor taught one class in a certain room on a certain date. If you need to break the fact down into individual students and individual salespeople, however, you’d need to create another table because the detail level of the fact table in this example doesn’t support individual students or salespeople. A data warehouse consists of groups of fact tables, with each fact table concentrating on a specific subject. Fact tables can share dimension tables (e.g., the same customer can buy products, generate shipping costs, and return times). This sharing lets you relate the facts of one fact table to another fact table. After the data structures are processed as OLAP cubes, you can combine facts with related dimensions into virtual cubes. Step 4: Construct a Conceptual Data Model After identifying the business processes, you can create a conceptual model of the data. You determine the subjects that will be expressed as fact tables and the dimensions that will relate to the facts. Clearly identify the key performance indicators for each business process, and decide the format to store the facts in. Because the facts will ultimately be aggregated together to form OLAP cubes, the data needs to be in a consistent unit of measure. The process might seem simple, but it isn’t. For example, if the organization is international and stores monetary sums, you need to choose a currency. Then you need to determine when you’ll convert other currencies to the chosen currency and what rate of exchange you’ll use. You might even need to track currency-exchange rates as a separate factor. Now you need to relate the dimensions to the key performance indicators. Each row in the fact table is generated by the interaction of specific entities. To add a fact, you need to populate all the dimensions and correlate their activities. Many data systems, particularly older legacy data systems, have incomplete data. You need to correct this deficiency before you can use the facts in the warehouse. After making the corrections, you can construct the dimension and fact tables. The fact table’s primary key is a composite key made from a foreign key of each of the dimension tables. Data warehouse structures are difficult to populate and maintain, and they take a long time to construct. Careful planning in the beginning can save you hours or days of restructuring. Step 5: Locate Data Sources and Plan Data Transformations Now that you know what you need, you have to get it. You need to identify where the critical information is and how to move it into the data warehouse structure. For example, most of our example company’s data comes from three sources. The company has a custom in-house application for tracking training sales. A CRM package tracks the sales-force activities, and a custom timereporting system keeps track of time. You need to move the data into a consolidated, consistent data structure. A difficult task is correlating information between the in-house CRM and time-reporting databases. The systems don’t share information such as employee numbers, customer numbers, or project numbers. In this phase of the design, you need to plan how to reconcile data in the separate databases so that information can be correlated as it is copied into the data warehouse tables. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 2 7 Steps to Data Warehousing 9 You’ll also need to scrub the data. In online transaction processing (OLTP) systems, data-entry personnel often leave fields blank. The information missing from these fields, however, is often crucial for providing an accurate data analysis. Make sure the source data is complete before you use it. You can sometimes complete the information programmatically at the source. You can extract ZIP codes from city and state data, or get special pricing considerations from another data source. Sometimes, though, completion requires pulling files and entering missing data by hand. The cost of fixing bad data can make the system cost-prohibitive, so you need to determine the most costeffective means of correcting the data and then forecast those costs as part of the system cost. Make corrections to the data at the source so that reports generated from the data warehouse agree with any corresponding reports generated at the source. You’ll need to transform the data as you move it from one data structure to another. Some transformations are simple mappings to database columns with different names. Some might involve converting the data storage type. Some transformations are unit-of-measure conversions (pounds to kilograms, centimeters to inches), and some are summarizations of data (e.g., how many total seats sold in a class per company, rather than each student’s name). And some transformations require complex programs that apply sophisticated algorithms to determine the values. So you need to select the right tools (e.g., Data Transformation Services—DTS—running ActiveX scripts, or third-party tools) to perform these transformations. Base your decision mainly on cost, including the cost of training or hiring people to use the tools, and the cost of maintaining the tools. You also need to plan when data movement will occur. While the system is accessing the data sources, the performance of those databases will decline precipitously. Schedule the data extraction to minimize its impact on system users (e.g., over a weekend). Step 6: Set Tracking Duration Data warehouse structures consume a large amount of storage space, so you need to determine how to archive the data as time goes on. But because data warehouses track performance over time, the data should be available virtually forever. So, how do you reconcile these goals? The data warehouse is set to retain data at various levels of detail, or granularity. This granularity must be consistent throughout one data structure, but different data structures with different grains can be related through shared dimensions. As data ages, you can summarize and store it with less detail in another structure. You could store the data at the day grain for the first 2 years, then move it to another structure. The second structure might use a week grain to save space. Data might stay there for another 3 to 5 years, then move to a third structure where the grain is monthly. By planning these stages in advance, you can design analysis tools to work with the changing grains based on the age of the data. Then if older historical data is imported, it can be transformed directly into the proper format. Step 7: Implement the Plan After you’ve developed the plan, it provides a viable basis for estimating work and scheduling the project. The scope of data warehouse projects is large, so phased delivery schedules are important for keeping the project on track. We’ve found that an effective strategy is to plan the entire warehouse, then implement a part as a data mart to demonstrate what the system is capable of doing. As you complete the parts, they fit together like pieces of a jigsaw puzzle. Each new set of data structures adds to the capabilities of the previous structures, bringing value to the system. Brought to you by Microsoft and Windows IT Pro eBooks 10 A Jump Start to SQL Server BI Data warehouse systems provide decision-makers consolidated, consistent historical data about their organization’s activities. With careful planning, the system can provide vital information on how factors interrelate to help or harm the organization. A solid plan can contain costs and make this powerful tool a reality. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 3 The Art of Cube Design 11 Chapter 3: The Art of Cube Design By Russ Whitney, Tim Ramey Cube design is more an art than a science. Third-party applications provide many templates and patterns to help a cube designer create cubes that are appropriate for different kinds of analysis (e.g., sales or budgeting). But in the end, the cube design depends on business rules and constraints specific to your organization. What quirks of your data keep you up at night? In dimensions such as Customers or Organization, you might have a hierarchy of values that change as often as you update the cube. Or you might have members that you want to include in multiple places in a hierarchy, but you don’t want to double-count the values when you aggregate those members. You can handle each of these situations in multiple ways, but which way is best? In our business intelligence (BI) development work, we see lots of problems in designing sales-forecasting cubes. Let’s look at a few common cube-design problems and learn how to solve them by using some techniques that you can apply to many types of cubes. Designing a Sales-Forecasting Cube When creating a sales-forecasting cube, a cube designer at our company typically gets the cube dimensions from the customer relationship management (CRM) system that our sales team uses for ongoing tracking of sales deals. In our CRM system, the pipeline (the list of sales contracts that representatives are working on) puts sales deals into one of three categories: Forecast (the sales representative expects to close the deal in the current quarter), Upside (the sales representative thinks the deal will be difficult to close in the current quarter), and Other (the sales representative expects to close the deal in a future quarter). Additionally, the projection defines who has agreed that a given pipeline deal should be included in the current quarter’s forecast. Each deal in the pipeline falls into one of four projection categories: Sales Rep Only (only the sales representative thinks the deal should be in the forecast and the manager has overridden the sales representative to remove the deal from the forecast), Manager Only (the manager has overridden the sales representative to include the deal in the forecast), Sales Rep & Mgr (manager and representative agree that the deal should be in the forecast), and Neither (nobody thinks the deal should be in the current quarter’s forecast). A straightforward cube design might include a dimension called Projection that has a member for each deal’s status and a dimension called Pipeline that has a member for each deal’s category, as Figure 1 shows. Brought to you by Microsoft and Windows IT Pro eBooks 12 A Jump Start to SQL Server BI Figure 1 A straightforward cube design By choosing different combinations of Pipeline and Projection, you can quickly answer questions such as “Which deals in the representative’s forecast did the sales manager and the sales representative both agree to?” or “Which deals in the representative’s Upside category did the manager override for the current quarter?” This dimension structure also lets users view the deals if you’ve enabled drillthrough, so sales managers can quickly see which deals make up the forecast number they’re committing to. The problem with this dimension structure is that the Projection dimension is relevant only when the user has selected the Pipeline dimension’s Forecast member. The sales representative is the only person who puts deals in the Upside and Other categories. The sales manager is responsible for agreeing or disagreeing with the sales representative’s deal categorization, but the manager’s input affects only the Forecast member. If users choose one of the invalid combinations, they will see no data—or even wrong data. For example, if the manager selects the deals in the current quarter’s forecast but doesn’t select both the Sales Rep & Mgr and Manager Only projections, the projected sales number that the cube reports for the current quarter’s forecast will be too low. Providing Valid Data One technique that would solve the wrong-data problem is the use of calculated members. You could create a calculated member on the Pipeline dimension for each valid combination of Pipeline and Projection, then hide the Projection dimension so that the manager needs to deal with only one dimension. This technique would let sales managers easily see the target that they’d committed to for the current quarter. The problem with this solution is that Analysis Services doesn’t support drillthrough operations on calculated members. In a sales-forecasting application, drillthrough is a mandatory feature because you need to be able to view the individual deals in the pipeline. Without drillthrough, you lose the ability to see individual sales deals in the pipeline. A better solution to this problem is one that you won’t find documented in SQL Server Books Online (BOL). In this approach, you create one dimension that contains all the valid combinations. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 3 The Art of Cube Design 13 Figure 2 shows the new dimension (labeled Entire Pipeline), which combines the original Pipeline and Projection dimensions into one dimension. You’ll notice two things in the new dimension that weren’t in the original dimensions. First, deals can appear in multiple locations in the hierarchy. For example, the deals comprising the Rep Commit member are also in the Mgmt Commit member if the manager has also committed to them. Second, the aggregation of the members to calculate their parents’ values needs to use a custom rollup formula so that the aggregation doesn’t count duplicated records more than once. We can solve both problems without duplicating rows in the fact table by taking advantage of the way Analysis Services joins the dimension tables together to compute cell values. Figure 2 Creating a new dimension Let’s look at the relationship between the fact-table entries and the Projection dimension table, which Figure 3 shows. The members of the Pipeline dimension (which we would have determined by using calculated members in the previous option) now have multiple rows in the dimension table. This structure might disprove two common assumptions about dimension tables: the assumption that the primary key in the dimension table must be unique and the assumption that a dimension member must correspond to only one row in a dimension table. Because of the way Analysis Services uses SQL to join the fact table to the dimension table when it builds the cube, neither of these assumptions is enforced. Using non-normalized tables lets us have one fact-table row that corresponds to multiple dimension members and one dimension member that corresponds to multiple categories of fact-table records. Multiple rows from the fact table have the same primary key, so those rows are included in the calculated value for that dimension member. We can calculate the correct values for every member in the dimension without increasing the size of the fact table. For very large fact tables, this technique can be a big time-saver, both when you’re creating the fact table and when you’re processing the cube. Brought to you by Microsoft and Windows IT Pro eBooks 14 A Jump Start to SQL Server BI Figure 3 The relationship between fact-table entries and the Projection dimension Of course, when a fact-table record appears in more than one dimension member, the parents of those members won’t necessarily contain the correct value. The default method of computing a member’s value from its children is to sum the children’s values. But summing won’t work in this case because some fact-table records would be included more than once in the parent’s total. The solution is to use unary operators that you associate with each member in a custom rollup calculation. The dimension table in Figure 2 shows the custom-rollup unary operators for each member in the dimension. The + unary operator means when the parent’s value is calculated, the calculation should add the value to the parent member, and the ~ unary operator means the calculation should exclude the value from the parent’s value. The Mgmt Commit member consists entirely of sales deals included in other dimension members, so Analysis Services ignores this member when calculating the value of its parent, Entire Pipeline. Analysis Services also needs to use a custom-rollup formula within the Mgmt Commit member because that member’s value isn’t the sum of its children. The Override-Excluded from Reps member is important for the manager to have available for analysis because it shows which deals the sales representative included in the forecast but the manager didn’t commit to. However, these deals aren’t part of the Mgmt Commit value, so Analysis Services needs to ignore them when aggregating the children of Mgmt Commit. Now we have a cube structure that meets our needs. All the information associated with sales deals is in one dimension and the rollup formulas are computed, so the aggregate values in the dimension are correct. Because we used no calculated members, we can still enable drillthrough to see individual sales deals. When you deploy this cube to your sales team, you can be confident that the query results are accurate. You will be praised by your coworkers and be showered with gifts and money—or maybe you’ll simply help your company’s bottom line. You can apply these techniques to many types of cubes. If you ever get into a situation in which you want to duplicate fact-table records in a dimension without duplicating them in the fact table, the combination of duplicating keys and using custom rollup formulas can be a great benefit. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 15 Chapter 4: DTS 2000 in Action By Larry Barnes The first version of Data Transformation Services (DTS), which Microsoft introduced with SQL Server 7.0, gave database professionals an easy-to-use, low-cost alternative to more expensive products in the data extraction, transformation, and loading (ETL) market. The first versions of most products leave gaps in their coverage, however, and DTS was no exception. Microsoft provided several enhancements in SQL Server 2000 that significantly increase DTS’s power and usability. Two new tasks, as well as upgrades to an existing task, are standout improvements. Let’s walk through a data ETL scenario that showcases these features as you create a SQL Server data mart from the FoodMart sample database that ships with SQL Server 2000. Introducing the Create FoodMart 2000 Package How many times have you wished that you could put SQL Server through its paces on a database larger than Northwind and Pubs? Actually, SQL Server ships with the larger FoodMart sample database, which is the source database for the FoodMart Analysis Services cube. The FoodMart database has just one drawback—it’s a Microsoft Access database. I created a set of DTS packages that takes the Access database and moves it to SQL Server. This scenario provides a good framework for introducing DTS’s key new features. Before diving into the details, let’s look at Figure 1, which shows the Create Foodmart 2000 DTS package. Figure 1 The Create Foodmart 2000 DTS package Brought to you by Microsoft and Windows IT Pro eBooks 16 • • • • • A Jump Start to SQL Server BI You can break down this package into 15 tasks that you group into five main steps: initializing global variables and the package state (Tasks 1—2) deleting the FoodMart database if it exists (Tasks 3—6) creating the FoodMart database and tables (Tasks 7—10) moving data from Access to SQL Server (Task 11) cleansing the data, creating star indexes, and adding referential integrity (Tasks 12—15) Before looking at these steps in detail, let’s look at global variables—the glue that holds the package together. Initializing Global Variables and the Package State Global variables are the nerve center of a DTS package because they provide a central location for DTS to share information. To create, view, and set global variable values, go to the DTS Package Designer’s toolbar, select Package Properties from the menu, then click the Global Variables tab, which Figure 2 shows. SQL Server 2000’s enhanced task support for global variables incorporates multiple task types—including ActiveX Script, Dynamic Properties, and Execute SQL tasks—which can set and retrieve global variable values. DTS 2000 and DTS 7.0 also support a wide range of data types, including COM components. Figure 2 Global Variables tab Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 17 The ActiveFoodMartConnections global variable, which Figure 2 shows, is an example of a COM component. This global variable, which I created as an output parameter in Task 4, stores an ADO Recordset object that contains records describing all active FoodMart connections. Task 1: Initializing global variables. To initialize the package global variables, you can write VBScript code into an ActiveX Script task, as Listing 1 shows. Listing 1: Script That Initializes Package Global Variables Function Main() ‘ Set the parameters required to initialize a package at runtime. DTSGlobalVariables(“CopyFoodmartPackage”).Value = “d:\demos\sql2000\foodmart\Foodmart Copy Tables.dts” DTSGlobalVariables(“CopyFoodmartPackageName”).Value = “Foodmart Copy Tables” DTSGlobalVariables(“PackageGuid”).Value= “{D0508D1B-6642-4DDD-8508-2F5DBA726C1A}” ‘ Set the Access Database filename and the SQL Server connection parameters. DTSGlobalVariables(“AccessDbFileName”).Value = “C:\Program Files\Microsoft Analysis Services\Samples\Foodmart 2000.mdb” DTSGlobalVariables(“SQLServerName”).Value = “(local)” DTSGlobalVariables(“DatabaseName”).Value = “FoodMart2000” DTSGlobalVariables(“Username”).Value = “sa” DTSGlobalVariables(“Password”).Value= “” ‘ Set the Directory that holds the SQL Server database files. Main = DTSTaskExecResult_Success End Function In VBScript, global variable assignments take the form DTSGlobalVariables(“name”).Value = “Input-value” where name is the global variable’s name and Input-value is the value that you assign to the global variable. Note that although I use VBScript for all packages, you can also use any other installed ActiveX scripting language, such as JScript or Perl. Task 2: Using .ini files to initialize global variables. Now, let’s look at the way the new Dynamic Properties task removes one of DTS 7.0’s major limitations—the inability to set key package, task, and connection values at runtime from outside the DTS environment. In DTS 7.0, developers had to manually configure packages as they moved through the package life cycle—from development to testing and finally to production. With DTS 2000, the package remains unchanged through the development life cycle; only the parameter settings made outside the package change. In this example, I use Windows .ini files to initialize the global variables. You can also initialize environment variables, database queries, DTS global variables, constants, and data files. Figure 3 shows the global variables that you can initialize. Note that the window also includes Connections, Tasks, and Steps properties. Brought to you by Microsoft and Windows IT Pro eBooks 18 A Jump Start to SQL Server BI Figure 3 Global variables you can initialize Later in this chapter, I show you how to initialize both Connections and Tasks properties. Each global variable is linked to one entry within the specified .ini file. Figure 4 shows the Add/Edit Assignment dialog box, in which you initialize the SQLServerName global variable with the SQLServerName key from the C:\Create-foodmart.ini file. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 19 Figure 4 Add/Edit Assignment dialogue box Listing 2 shows the Createfoodmart.ini file code. Note that this .ini file is the only parameter in this package that isn’t dynamic. You need to place it in the C directory or modify the task to point to the .ini file’s new location. Listing 2: Code for the Createfoodmart.ini File [Foodmart Parameters] CopyFoodmartPackage=d:\demos\sql2000\foodmart\Foodmart Bulk Copy Tables.dts CopyFoodmartPackageName=Foodmart Bulk Copy Tables PackageGuid={F4EE2316-97BE-43CA-9C2B-3371972435D3} AccessDbFileName=C:\Program Files\Microsoft Analysis Services\Samples\Foodmart 2000.mdb DatabaseDir=d:\demos\sql2000\foodmart\ SQLServerName=(local) DatabaseName=FoodMart2000 Username=sa The next two instances of the Dynamic Properties task use these initialized global variables to dynamically set important connection information, the SQL Server database files directory, and the CopyFoodMart DTS package filename, package name, and package GUID. The next four tasks delete active FoodMart database users and drop any existing FoodMart database to make sure that the system is ready for the database creation. Brought to you by Microsoft and Windows IT Pro eBooks 20 A Jump Start to SQL Server BI Preparing the Existing Environment Task 3: Setting the connection parameters. The power of the Dynamic Properties task becomes evident when you set the connection parameters. The Dynamic Properties task uses the global variables that the .ini files have already initialized to initialize SQL Server OLE DB connection properties. DTS in turn uses the connection properties to connect to SQL Server. On the General tab in the Dynamic Properties Task Properties window, which Figure 5 shows, you can see that global variables set three connection parameters and a constant value sets one parameter. Figure 5 Dynamic Properties Task Properties General tab Clicking Edit brings you to the Dynamic Properties Task: Package Properties window, which Figure 6 shows. The window displays the specific property (in this case the OLE DB Data Source property) that the global variable is initializing. Clicking Set takes you back to the Add/Edit Assignment dialog box. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 21 Figure 6 Dynamic Properties Task: Package Properties Window Task 4: Getting the FoodMart connection. After you set the connection parameters, you need to drop the existing FoodMart database. If users are logged in to the database, you have to terminate their sessions before you take that action. Figure 7 shows the General tab in the Execute SQL Task Properties window, which resembles the same tab in DTS 7.0. However, the Execute SQL Task Properties window in DTS 2000 incorporates the new Parameters button and the new “?” parameter marker in the SQL query. Brought to you by Microsoft and Windows IT Pro eBooks 22 A Jump Start to SQL Server BI Figure 7 Execute SQL Task Properties General tab Clicking the Parameters button takes you to the Input Parameters tab in the Parameter Mapping window, which Figure 8 shows. This window lets you pass input parameters into the Execute SQL task and place output parameters from the Execute SQL task in global variables—actions you can’t take in SQL Server 7.0. Let’s take a closer look. Figure 8 Parameter Mapping Input Parameters tab Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 23 In the Parameter Mapping window, any global variable can set the SQL parameter marker, named Parameter 1. For this task, you pass the input FoodMart database name into the query by using the DatabaseName global variable. DTS 2000 packages give you the flexibility to specify the database name at runtime. In contrast, SQL Server 7.0 requires you to use additional SQL statements within the task to accomplish the same goal. Figure 9 shows how you cache the query’s output recordset for use in the next task. On the Output Parameters tab, you can store one value at a time by first choosing the Row Value option, then mapping the SELECT LIST values one-to-one with global variables. You can use all values or a subset. Figure 9 Parameter Mapping Output Parameters tab The ability to pass input parameters into the SQL task and place output parameters from the SQL task in global variables, as well as to store one value at a time, might seem minor at first. However, these features let you use the Execute SQL task in more places, providing a high-performance alternative to the DTS data pump transformation capability. As a general rule, set-based operations perform better than transformations. When I assembled DTS packages in SQL Server 7.0, I had to include additional SQL code within each task to set the correct input parameters and use temporary tables to store output parameters. In DTS 2000, you can eliminate from each SQL task the code you had to write in DTS 7.0 for passing input parameters and storing output parameters. In eliminating the code, you reduce the volume and complexity of code and therefore the time required to develop and test your DTS packages. Task 5: Killing the FoodMart connections. To terminate processes that are accessing the FoodMart database, apply the SQL Server KILL command. Task 5’s ActiveX script code loops through the rowset that is stored in the ActiveFoodMartConnections global variable, calling the code that Listing 3 shows. First, the ActiveX script builds the database connection string from DTS global variables, then saves the connection as a DTS global variable that future ActiveX Scripting tasks can use Brought to you by Microsoft and Windows IT Pro eBooks 24 A Jump Start to SQL Server BI without first having to define it. You can use this connection to build and execute one KILL command for every server process ID (SPID) in the output rowset. After you kill all connections, you’re ready to drop the existing FoodMart database. Listing 3: Code That Kills the FoodMart Connections Function Main() ‘ Get SQL Server Connection parameters. srvName = DTSGlobalVariables(“ServerName”).Value dbName = DTSGlobalVariables(“DatabaseName”).Value strUserName = DTSGlobalVariables(“UserName”).Value strPassword = DTSGlobalVariables(“Password”).Value ‘ Build the ADO Connection string and connect to SQL Server. Set cn = CreateObject(“ADODB.Connection”) strCn = “Provider=SQLOLEDB;Server=” & srvName & “;User Id=” & strUserName & “;Password=” & strPassword & “;” cn.Open strCn ‘ Cache this database connection. Set DTSGlobalVariables(“DatabaseConnection”).Value = cn ‘ Loop through the recordset that the previous Execute SQL task returned ‘ Kill each connection accessing FoodMart before dropping the database. Set rs = DTSGlobalVariables(“ActiveFoodMartConnections”).Value while rs.EOF <> True strSQL = “Kill “ & cstr(rs(0)) cn.Execute strSQL rs.MoveNext wend ‘ Clean up. rs.Close Set DTSGlobalVariables(“ActiveFoodMartConnections”).Value = Nothing Main = DTSTaskExecResult_Success End Function Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 25 Task 6: Dropping FoodMart. The ActiveX script that you run for Task 6 retrieves the ADO connection that you cached in the previous task, as Listing 4 shows. Then, you build the DROP DATABASE statement and execute it. Note that you have to build the statement explicitly each time for both the KILL and DROP DATABASE commands because the SQL Data Definition Language (DDL) doesn’t support the “?” parameter marker. For that reason, you can’t pass the database or SPID as an input parameter at the same time you pass the FoodMart database name. Now that you’ve finished cleaning up the environment, you’re ready to build the new FoodMart database. Note that you designate the workflow from Task 6 to Task 7 as On Completion not On Success. You want the package to continue executing if the DROP DATABASE command failed because the database didn’t exist. To change the workflow precedence, highlight the workflow arrow that connects Task 6 to Task 7, right-click, select Properties, then select Completion Success or Failure from the Precedence drop-down combo box. Listing 4: Code That Drops the FoodMart Database Function Main() on error resume next ‘ Get the connection and drop the named database. Set cn = DTSGlobalVariables(“DatabaseConnection”).Value dbName = DTSGlobalVariables(“DatabaseName”).Value cn.Execute “DROP DATABASE “ & dbName Main = DTSTaskExecResult_Success End Function Creating the FoodMart Database and Tables You might wonder why I haven’t recommended using the Access Upsizing Wizard to move the FoodMart database to SQL Server. Although the Upsizing Wizard, which became available in Access 95, is a helpful tool that easily migrates Access databases to SQL Server, the wizard doesn’t work as well for large Access databases such as FoodMart. For these databases, you need to stage an Accessto-SQL Server migration in multiple steps similar to the steps in this example—creating the database, creating the database objects, loading the database, cleansing the data, and adding referential integrity. In deciding which utility to use, you have to take into account such factors as the underlying physical database design, table design, data type selection, and how much flexibility you have in determining when to move and cleanse data. Task 7: Creating FoodMart. The script that Listing 5 shows creates the FoodMart2000_Master database that appears in the Data Files tab on the FoodMart2000 Properties window in Figure 10. Note that the database’s size is 25MB, expandable by 10MB. Although the database can grow to 35MB, it reclaims this space when you issue a DBCC ShrinkDatabase operation from the cleanup task. Again, I used an ActiveX Script task rather than an Execute SQL task to specify at runtime the database name and the directory in which I wanted to create the new database files. I used the scripting task because DDL statements don’t support parameter markers. Brought to you by Microsoft and Windows IT Pro eBooks 26 A Jump Start to SQL Server BI Listing 5: Code That Creates the New FoodMart Database Function Main() ‘ Get the FoodMart database name and the directory where you want to ‘ create the database files. dbName = DTSGlobalVariables(“DatabaseName”).Value strDir = DTSGlobalVariables(“DatabaseDir”).Value ‘ Append the directory string with a delimiter if necessary. pos = InStr(Len(strDir), strDir, “\”) If pos = 0 Then strDir = strDir & “\” End If ‘ Get the open connection. Set cn = DTSGlobalVariables(“DatabaseConnection”).Value ‘ Create the Primary Database, its filename and log file. strSQL = “CREATE DATABASE [“ & dbName & “] “ BuildDBFile strSQL, dbName, strDir, “_Master”, “.MDF”, “25”, “10” strSQL = strSQL & “ LOG “ BuildDBFile strSQL, dbName, strDir, “_LOG”, “.LDF”, “20”, “10” cn.Execute strSQL Main = DTSTaskExecResult_Success End Function Private Sub BuildDBFile(iSQL, iDBName, iDir , iAppend , iExt , iSize , iFileGrowth ) iSQL = iSQL & “ ON (NAME = N’” & iDBName & iAppend & “‘,” iSQL = iSQL & “ FileName = N’” & iDir & iDBName & iAppend & iExt & “ ‘,” AddSize iSQL, iSize, iFileGrowth End Sub Private Sub AddSize(iSQL, iSize, iFileGrowth) iSQL = iSQL & “ SIZE=” & iSize & “, FILEGROWTH = “ & iFileGrowth & “)” End Sub Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 27 Figure 10 FoodMart 2000 Properties Data Files tab Task 8: Setting the database properties. Set Database Properties is an ActiveX Script task that initializes database level settings by calling the sp_dboption stored procedure. One of the database options settings worth noting here is bulkcopy, which the ActiveX script code sets to true. Bulkcopy’s true setting lets the data load faster because it means that SQL Server doesn’t log row-insert operations. However, be aware that for nonlogged bulk load to work, your database settings must meet additional conditions. These conditions are well documented in SQL Server Books Online (BOL). Task 9: Initializing FoodMart’s connections. The Initialize Food-Mart Connections task initializes FoodMart’s SQL Server OLE DB connection and the parameters required for the Execute Package task. Figure 11 shows the General tab in the Dynamic Properties Task Properties window. You’ve already set the OLE DB properties, so let’s set a task parameter. Clicking Edit on the General tab and highlighting the PackageGuid destination property opens the Package Properties window, which Figure 12 shows. In this window, you can select the task, the PackageID, and the PackageID’s default value. Once again, the Dynamic Properties task gives you maximum flexibility for configuring a property at runtime, a capability that’s vital when you move a package between environments. Brought to you by Microsoft and Windows IT Pro eBooks 28 A Jump Start to SQL Server BI Figure 11 Dynamic Properties Task Properties General tab Figure 12 Dynamic Properties Task: Package Properties Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 29 Task 10: Creating tables. After you choose the package properties, you can create 24 database tables and populate them. Note the size of the FoodMart database—too large to use Access’s Upsizing Wizard. FoodMart holds enough data to warrant the explicit creation of the database schema to optimize the final database size. Your next step is to run the initial load. Task 11: Moving data from Access to SQL Server. Many ETL projects are complex enough to warrant the separation of logic into multiple packages. When you use SQL Server 7.0, linking these multiple packages together in a workflow is a challenge. The technique commonly used—creating an EXECUTE process command, then using the dtsrun command-line interface—is a cumbersome solution. In addition, in SQL Server 7.0 you can’t set runtime parameters. SQL Server 2000 addresses both shortcomings with a new task, the Execute Package task. You use this task to invoke the DTS package that moves data from Access to SQL Server. I examine the package in more detail later in this chapter. First, let’s look at the General tab in the Execute Package Task Properties window, which Figure 13 shows. Task 9 sets the key values for this package at runtime. The window in Figure 12 displays the available properties. Be aware that for all tasks, the minimum properties you need to set are the PackageName, the package FileName, and the PackageGuid so you can dynamically set the package properties to work correctly at runtime. Figure 13 Executive Package Task Properties General tab The Execute Package task incorporates another valuable feature: You can initialize the called package from the task in the Execute Package Task Properties window. To initialize the called package, you can choose either the Inner Package Global Variables tab or the Outer Package Global Variables tab, which Figure 14 shows. For this example, I used Outer Package Global Variables to initialize global variables of the same name within the called package. Figure 15 shows the called package that you use to copy the data from Access to SQL Server. This package uses a technique Brought to you by Microsoft and Windows IT Pro eBooks 30 A Jump Start to SQL Server BI similar to the initialization technique that the main package uses. After the initialization task completes, each of the 24 transformation tasks fire and complete independently of one another. Figure 14 Outer Package Global Variables tab Figure 15 OuterPackages Global Variables: the called package Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 31 In each transformation, you map the source to the destination column. DTS refers to this action as the data pump. Figure 16 shows the transformations for the account table in the Transform Data Task Properties window, Transformations tab. You can set one transformation for the entire row, as Figure 16 shows, or map the table column-to-column, as Figure 17 shows. You might expect that minimizing the number of transformations would significantly speed up the copy task’s performance. However, my SQL Server Profiler tests showed that the timing results are similar for both packages. One of the test runs revealed that both techniques use the BULK INSERT command to transfer information to SQL Server. BULK INSERT used as a default command is another new SQL Server 2000 feature. When you use BULK INSERT capabilities, you can greatly improve execution time for your transformation tasks. However, this performance gain comes at a cost: Inserting data in bulk mode doesn’t work with the new SQL Server 2000 logging features. Figure 16 Transform Data Task Properties Transformations tab Brought to you by Microsoft and Windows IT Pro eBooks 32 A Jump Start to SQL Server BI Figure 17 Mapping the table column-to-column To understand the problem, let’s look at Figure 18, which shows the Options tab for one of the transformations. Note that the Use fast load option is enabled by default for a copy transformation. Disabling this feature changes the method of loading the destination data rows from a nonlogged, bulk-load interface to a logged interface. The quick Profiler timing tests I ran on my machine show that the task runtime is more than 10 times longer when you disable Use fast load. However, when you run a transformation with Use fast load enabled, you can’t take advantage of one of the new SQL Server 2000 logging features, which lets you save copies of all rows that fail during the transformation. This logging feature is valuable because it lets you log and later process all failing rows for a particular transformation. ETL processing often requires you to make choices—and a trade-off accompanies every choice. Here, you must decide between set-based processing and row-based processing when you build your transformations. Set-based processing usually provides better performance, whereas row-based processing gives you more flexibility. I use set-based processing in the next two tasks, in which I cleanse the data and create primary keys and referential integrity. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 33 Figure 18 Transform Data Task Properties Options tab Tasks 12 and 13: Cleansing the data. The FoodMart Access database suffers from data-quality problems. For this exercise, let’s look at the snowflake schema for the sales subject area, whose key values and structure Figure 19 shows. Brought to you by Microsoft and Windows IT Pro eBooks 34 A Jump Start to SQL Server BI Figure 19 Snowflake schema for the sales subject area The sales_fact 1997 table holds foreign key references to FoodMart’s key dimensions: products, time, customers, promotions, and store geography. I structured the products and store dimensions in a snowflake pattern to reflect the hierarchies for each dimension; for example, each product has a product family, department, category, subcategory, and product brand. Several rows duplicate between sales_fact_1997 (8 rows) and sales_fact_1998 (29 rows). If you want to apply star indexes and referential integrity to the star schema, you have to purge the duplicated data. This challenge is nothing new to developers with data warehouse experience; typically 80 percent of total project time is spent on data cleansing. The ETL developer has to decide whether to use set-based processing or row-based processing for the data-cleansing phase of the project. For this example, I used set-based processing. To cleanse the sales_fact_1997 table, you can run the SQL code that Listing 6 shows. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 4 DTS 2000 in Action 35 Listing 6: Code That Cleanses the sales_fact_1997 Table SELECT time_id, product_id,store_id, promotion_id, customer_id, COUNT(*) AS dup_count INTO #tmp_sales_fact_1997 FROM dbo.sales_fact_1997 GROUP BY time_id, product_id,store_id, promotion_id, customer_id HAVING COUNT(*) > 1 BEGIN TRANSACTION DELETE dbo.sales_fact_1997 FROM dbo.sales_fact_1997 s INNER JOIN #tmp_sales_fact_1997 t ON s.time_id = t.time_id AND s.product_id = t.product_id AND s.promotion_id = t.promotion_id AND s.customer_id = t.customer_id DROP TABLE #tmp_sales_fact_1997 COMMIT The first step in cleansing the data is to find all the rows that contain duplicate entries and create a spot to store them; in this example, the code stores the results in a temporary table. Next, it deletes the duplicate entries from the fact table. Then, the code deletes the table that it used to store the duplicates. Note that in using set-based processing to cleanse information before inserting it into the star or snowflake schema, you introduce data loss because you don’t re-insert the distinct duplicate rows that contain identical key values into the table. I decided to use set-based processing in this example because I don’t know enough about the underlying data to determine which of the duplicate rows is the correct one. In a real project, you place these duplicate rows in a permanent table in a data warehouse or data mart metadata database that you establish to store rows that fail the data-cleansing process. You can then examine these failed rows to determine what exception processing should occur. The data mart database also stores additional information about package execution, source data, and other key information that describes and documents the ETL processes over time. After cleansing the fact tables, you can create star indexes and add referential integrity. Task 14: Creating star indexes. Task 14 creates a primary key for both the sales_fact_1997 and sales_fact_1998 tables. The primary key, which is also called a star index, is a clustered index that includes each of the fact tables’ foreign keys that reference a dimension’s primary key. You can realize several benefits from creating primary keys; one significant benefit is that the query optimizer can use this primary key for a clustered index seek rather than a table scan when it builds its access plan. The query optimizer takes advantage of the star index in the code example that Listing 7 shows. Note that the query-access patterns demonstrate how much the star index can speed up your queries; for example, the execution time in the query that Listing 7 shows plummeted by two-thirds when I added the star index. Using a star index in queries for very large databases (VLDBs) carries another important benefit: The query optimizer might decide to implement a “star join,” which unions the smaller dimensions together before going against the fact table. Usually, you want to avoid unions Brought to you by Microsoft and Windows IT Pro eBooks 36 A Jump Start to SQL Server BI for the sake of efficient database optimization. However, a star join is a valid and clever optimization technique when you consider that the fact table might be orders of magnitude larger than its dimensions. Listing 7: Code That Creates a Primary Key (Star Index) for the sales_fact Tables SELECT t.the_year, t.quarter, p.brand_name, c.state_province, c.city, s.store_name, SUM(sf.store_sales) AS Sales, SUM(sf.store_cost) AS Cost, SUM(sf.unit_sales) AS “Unit Sales” FROM sales_fact_1998 sf INNER JOIN customer c ON sf.customer_id = c.customer_id INNER JOIN product p ON sf.product_id = p.product_id INNER JOIN time_by_day t ON sf.time_id = t.time_id INNER JOIN store s ON sf.store_id = s.store_id WHERE t.the_year = 1998 AND t.quarter = ‘Q3’ AND p.brand_name = ‘Plato’ GROUP BY t.the_year, t.quarter, p.brand_name, c.state_province, c.city, s.store_name ORDER BY t.the_year, t.quarter, p.brand_name, c.state_province, c.city, s.store_name Task 15: Adding referential integrity. The last major task in this package adds referential integrity, which links all the star schema’s foreign keys to their associated dimensions’ primary keys. As a general rule, adding referential integrity is beneficial because it ensures that the integrity of the data mart is uncompromised during the load phase. Administrators for large data warehouses might choose not to implement this step because of the extra overhead of enforcing referential integrity within the database engine. Cleanup, the final task, uses an ActiveX script to invoke DBCC ShrinkDatabase and to clean up the connection that the global variables are storing. A productionquality DTS package includes additional tasks, such as a mail task that sends the status of the package execution to the DBA team. Change Is Good Sometimes little things make a big difference. This maxim is certainly true for SQL Server 2000’s DTS enhancements. The Create Foodmart 2000 package showcases two new tasks in particular: the Dynamic Properties and Execute Package tasks, which help DTS programmers implement productionquality packages. And when Microsoft added I/O capabilities to the Execute SQL task, the company established global variables as the hub of activity within a DTS package. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 5 Rock-Solid MDX 37 Chapter 5: Rock-Solid MDX By Russ Whitney The MDX language is powerful but not easy to use. On the surface, MDX looks like SQL, but it can quickly become more complex because of the multidimensional nature of the underlying cube data. After more than 3 years of using MDX, I’ve found I’m more productive when I apply design and debugging techniques that help me better understand MDX and create more accurate MDX statements. The techniques I use for developing MDX are similar to those I use for developing software in other languages: for complex problems, I use pseudo coding and debug the code by displaying intermediate results. Let’s look at an example of how I use these techniques, and along the way, I’ll show you how to use a .NET language to develop MDX user-defined functions (UDFs). If you have any formal software-development education, you know that to solve a complex problem, you first break the problem into parts and solve each part independently. Then, it’s always a good idea to step through each line of your code, using a debugger to verify that the code works as intended. Most software developers know that these practices are good habits, but not enough programmers apply them. These good programming habits can help you effectively deal with MDX’s complexity. For example, say you need to answer a typical business question such as, “Based on unit sales, what are the top three brand names for each product department?” The MDX query that Listing 1 shows answers the question; Figure 1 shows the results. Listing 1: Query That Returns the Top Three Brand Names Based on Unit Sales SELECT {[Unit Sales]} ON COLUMNS, GENERATE( [Product Department].MEMBERS, { [Product].CURRENTMEMBER, TOPCOUNT( DESCENDANTS( [Product].CURRENTMEMBER, [Brand Name] ), 3, [Unit Sales] ) } ) ON ROWS FROM Sales Brought to you by Microsoft and Windows IT Pro eBooks 38 A Jump Start to SQL Server BI Figure 1 The results generated by the query in Listing 1 I used the familiar FoodMart 2000 Sales cube that comes with Analysis Services as the basis for my example. I have enough experience with MDX that when I wrote this query, it ran the first time (thus I skipped the good habit of breaking the code into parts). But the query is complicated because it performs ranking (TOPCOUNT) inside an iterative loop (GENERATE), and I wasn’t sure I was getting the answer I really wanted. Let’s see how I work through the problem in a way that emphasizes modularity (i.e., addressing each part of the problem separately) and accuracy. First, I use a design methodology called pseudo coding. Pseudo coding is a process of writing in plain language the steps for how you plan to implement your solution. For this problem, I want my code to follow the process that the pseudo code below describes. For each product department, 1. find the set of all brand names for this product department 2. return the product department name 3. return the three brand names that have the most unit sales When I start to translate this pseudo code into MDX, I get the following: <<ANSWER>> = GENERATE( [Product Department].MEMBERS, <<Dept and Top Brands>> ) Here, the GENERATE() function steps through a set of items and evaluates an MDX expression for each item in the set. This statement shows that to get the answer, I need to determine the product department name and the top brand names within it for each product department. Next, I expand the <<Dept and Top Brands>> item in the previous statement to call out the current product department. The following expression shows that I need another expression to determine the top brands within this department: Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 5 Rock-Solid MDX 39 <<Dept and Top Brands>> = { [Product].CURRENTMEMBER, <<Top Brands Within Dept>> } To determine the top brands within the product department, I use the TOPCOUNT() function and specify that I want the top three brands based on unit sales: <<Top Brands Within Dept>> = TOPCOUNT( <<Brands Within Dept>>, 3, [Unit Sales] ) Finally, I determine the brands within the product department by using the DESCENDANTS() function with the selected product department: <<Brands Within Dept>> = DESCENDANTS( [Product] .CURRENTMEMBER, [Brand Name] ) Remember, the GENERATE() function steps through the product departments and sets the product dimension’s CURRENTMEMBER to the name of the current product department while evaluating the inner MDX expression. If I take the MDX code fragments I created above and use the WITH statement to turn the code into a modular MDX statement, I get the MDX statement that Listing 2 shows. In Listing 2, I’ve used WITH statements to separate two of the three pseudo code steps from the main body of the query (SELECT ...FROM) to improve readability and make the overall query use a more modular approach to solve the problem. If I execute this new MDX statement in the MDX Sample Application, I get the answer that Figure 2 shows. Notice that Figure 2’s results aren’t the same as Figure 1’s even though I used the same MDX functions to develop the queries. Which answer is correct? Listing 2: Modular Version of Listing 1’s MDX Statement WITH SET [Brands Within Dept] AS ‘DESCENDANTS( [Product].CURRENTMEMBER, [Brand Name])’ SET [Top Brands Within Dept] AS ‘TOPCOUNT( [Brands Within Dept], 3, [Unit Sales] )’ SELECT {[Unit Sales]} ON COLUMNS, GENERATE( [Product Department].MEMBERS,{ [Product].CURRENTMEMBER, [Top Brands Within Dept] } ) ON ROWS FROM Sales Brought to you by Microsoft and Windows IT Pro eBooks 40 A Jump Start to SQL Server BI Figure 2 The results generated by the MDX statement Close examination reveals that Figure 2 definitely doesn’t show the right answer. For one thing, Hermanos isn’t a brand in the Alcoholic Beverages department. But even if you didn’t know that Hermanos belongs in the Produce department, you’d likely notice that the Unit Sales values of the three brands listed as the top brands in the Alcoholic Beverages department (Hermanos, Tell Tale, and Ebony) total more than the amount for the whole Alcoholic Beverages department ($6838.00). These two incongruities prove that Figure 2 shows the wrong answer, but how can I find out whether Figure 1 shows the correct answer? To answer this question and to understand how MDX executes this query and other complex queries, I developed a simple MDX debugging tool. This tool is an MDX UDF that uses the Windows MessageBox() function to display any string. The UDF lets you display on the screen intermediate results inside an MDX query while the query is executing. Listing 3 shows the UDF’s source code, which I wrote in C#. Listing 3: MDX UDF Written in C# using System; using System.Windows.Forms; using System.Runtime.InteropServices; namespace dotNETUDFs { /// <summary> /// Functions for use in MDX /// </summary> [ClassInterface(ClassInterfaceType.AutoDual)] public class MDXFuncs { Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 5 Rock-Solid MDX 41 private int counter = 10; // Constructor public MDXFuncs() { MessageBox.Show(“MDXFuncs constructed”); } // This controls how many // message boxes will display // before the // the next Reset() must be called. public int Reset(int Count) { counter = Count; return Count; } // Displays a message box with // the specified caption and contents // and returns the // contents. Once the counter // goes to zero, you must call // Reset() again for messages to appear. public string MsgBox(string caption, string sList) { if (counter > 0) { MessageBox.Show(sList, caption); counter -= 1; } return sList; } } } Brought to you by Microsoft and Windows IT Pro eBooks 42 A Jump Start to SQL Server BI It took me a while to figure out the steps for developing a UDF with C#. So if you haven’t already developed an MDX UDF with a .NET language, here are the steps you need to follow: 1. Create a .NET-project type of class library. 2. Edit the line in the AssemblyInfo.cs file that contains the AssemblyVersion information so that it contains a hard-coded version number rather than an auto-generated version number. In my UDF, I used the following line: [assembly: AssemblyVersion(“1.0.0.0”)] .NET is picky about assembly version numbers, and without a constant version number, I couldn’t get MDX to recognize my UDFs. 3. Open the Project-Properties dialog box and change the Register for COM Interop flag in the Build properties to TRUE. This change registers your .NET class library as a COM DLL, which is required for MDX UDFs. 4. Place a ClassInterface statement just before the start of the class definition, as Listing 3 shows. This statement tells Visual Studio how to expose the class to the COM interoperability layer. 5. Add a using System.Runtime.Interop Services statement at the start of your C# source file, as Listing 3 shows. The ClassInterface statement in Step 4 requires InteropServices. When these steps are complete, you’re ready to add methods to your class definition, compile them, and use them from MDX. For my UDF, I created a method called MsgBox() that displays on the screen a box containing a message and caption that I specified as the method’s parameters. The method returns the message that it displays so that you can embed the method in the middle of an MDX query without altering the query results. Compiling a C# project creates a DLL and a TLB file in the project’s bin/Debug subdirectory. The TLB file is the COM type library that you need to register with Analysis Services to make your C# methods available for use. I used the following statement in the MDX Sample Application to register my type library. Note that dotNETUDFs is the name I chose for my C# project. USE LIBRARY “C:\Documents and Settings\rwhitneyMy Documents\Visual Studio Projects\dotNETUDFs\bin\DebugdotNETUDFs.tlb” After the library is registered, you can immediately start using the C# methods. The query in Listing 4 shows the code I used to embed the C# MsgBox() method inside Listing 3’s MDX query. MsgBox() requires and returns only string items, but the TOPCOUNT() function returns a set of members. To make the two functions compatible, I sandwiched the MsgBox() method between the MDX functions STRTOSET() and SETTOSTR() to convert the TOPCOUNT() set into a string and back to a set. Figure 3 shows the first message that the screen displays when you execute Listing 4’s query. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 5 Rock-Solid MDX 43 Listing 4: Query That Contains the C# MsgBox() Method SELECT {[Unit Sales]} ON COLUMNS, GENERATE( [Product Department].MEMBERS, { [Product].CURRENTMEMBER, STRTOSET(MsgBox(“TOPCOUNT Results”, SETTOSTR( TOPCOUNT( DESCENDANTS( [Product].CURRENTMEMBER, [Brand Name] ), 3, [Unit Sales] ) ))) } ) ON ROWS FROM Sales Figure 3 The results generated by the query in Listing 4 In the C# MsgBox() method, notice that I use a counter variable to limit the number of times a message is displayed on the screen. This limit is helpful when the MsgBox() method is called hundreds or thousands of times in a query. I could also achieve the same result by using a Cancel button on the message box rather than a counter. When the counter in my example reaches its limit, I must call the Reset method to restore the counter to a nonzero value so that it once again displays messages. I used the following separate MDX query to call the Reset method: WITH MEMBER Measures.Temp AS ‘Reset(5)’ SELECT { Temp } ON COLUMNS FROM Sales Now I could use the MsgBox() method to figure out why the query in Listing 2 returned the wrong result. I altered Listing 2’s query as Listing 5 shows. I used the MsgBox() method to display what the CURRENTMEMBER of the product dimension was when the [Brands Within Dept] set was evaluated. I learned that the [Brands Within Dept] set was evaluated only twice during the query execution instead of each time GENERATE() discovered a product department. Also, the CURRENTMEMBER was the All member (i.e., the topmost member) of the product dimension, not a product department. This means that Analysis Services evaluates and caches a WITH SET clause for the rest of the query execution. That’s why Listing 2’s query results were wrong. Brought to you by Microsoft and Windows IT Pro eBooks 44 A Jump Start to SQL Server BI Listing 5: Query That Uses the MsgBox() Method to Discover the Problem WITH SET [Brands Within Dept] AS ‘DESCENDANTS( STRTOTUPLE( MsgBox(“Product CURRENTMEMBER”, TUPLETOSTR( (Product.CURRENTMEMBER) ))).item(0) , [Brand Name] )’ SET [Top Brands Within Dept] AS ‘TOPCOUNT( [Brands Within Dept], 3, [Unit Sales] )’ SELECT {[Unit Sales]} ON COLUMNS, GENERATE( [Product Department].MEMBERS,{ [Product].CURRENTMEMBER, [Top Brands Within Dept] } ) ON ROWS FROM Sales By designing your MDX queries one part at a time, as I demonstrated in this example with pseudo code, you can tackle complex problems. Then, you can make sure the queries are operating correctly by displaying the results one part at a time. I hope you find this powerful two-part process useful for creating your own MDX. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 45 Chapter 6: XML for Analysis: Marrying OLAP and Web Services By Rob Ericsson XML for Analysis (XMLA)—a Web-service standard proposed and supported by Microsoft and leading OLAP companies—brings together Web services and OLAP technologies by providing an XML schema for OLAP and data-mining applications. Essentially, XMLA lets you explore and query multidimensional data through Web services, which means analytical applications can move away from their expensive and difficult-to-maintain client/ server roots toward a more flexible, Web-based architecture. XML Web services architectures connect applications and components by using standard Internet protocols such as HTTP, XML, and Simple Object Access Protocol (SOAP). These architectures offer the promise of interoperable distributed applications that can be shared between and within enterprises. Amazon.com, for example, uses Web services to support associate programs that let third parties sell from its catalog, and Microsoft’s MapPoint Web service integrates location-based services into a variety of applications. Web services are becoming crucial pieces of enterprise application architecture by letting you loosely couple services from disparate applications in a way that’s easy to maintain as business processes change. The XMLA specification, available at http://www.xmla.org/, describes the following design goals: • Provide to remote data-access providers a standard data-access API that application developers can use universally across the Internet or a corporate intranet to access multidimensional data. • Optimize a stateless architecture that requires no client components for the Web and minimal round-trips between client and server. • Support technologically independent implementations of XMLA providers that work with any tool, programming language, technology, hardware platform, or device. • Build on open Internet standards such as SOAP, XML, and HTTP. • Leverage and reuse successful OLE DB design concepts so that application developers can easily enable OLE DB for OLAP applications and OLE DB providers for XMLA. • Work efficiently with standard data sources such as relational OLAP databases and data-mining applications. By fulfilling these design goals, XMLA provides an open, industry-standard way to access multidimensional data from many different sources through Web services—with support from multiple vendors. Brought to you by Microsoft and Windows IT Pro eBooks 46 A Jump Start to SQL Server BI XMLA is based on SOAP, and you can use it from any application-programming language that can call SOAP methods, such as Visual Basic .NET, Perl, or Java. SOAP is a lightweight, XML-based protocol for exchanging structured and type information over the Web. Structured information contains content and an indication of what that content means. For example, a SOAP message might have an XML tag in it called CustomerName that contains customer name information. A SOAP message is an XML document that consists of a SOAP envelope (the root XML element that provides a container for the message), an optional SOAP header containing application-specific information (e.g., custom-authentication information), and a SOAP body, which contains the message you’re sending. Calling SOAP methods is simply a matter of wrapping the arguments for the SOAP method in XML and sending the request to the server. Because SOAP’s overall goal is simplicity, the protocol is modular and easy to extend to new types of applications that can benefit from Web services. You can use Internet standards to integrate SOAP with your existing systems. Most mainstream development platforms offer some support for calling SOAP-based Web services. Both Java 2 Enterprise Edition (J2EE) and the Microsoft .NET Framework have strong support for Web services, making the invocation of remote services almost transparent to the developer. Besides working with XMLA directly, you can use the Microsoft .NET-based ADO MD.NET library to build .NET applications that use XMLA. ADO MD.NET is the successor to the OLE DB for OLAP—based ADO MD. However, I don’t cover ADO MD.NET in this chapter. Instead, I show you how to use the underlying XMLA protocol to build an analytic application on any device or platform or in any language that supports XML. I assume you have some knowledge of OLAP fundamentals, at least a passing familiarity with MDX, and some exposure to XML. For an introduction to XML Web services, see Roger Wolter’s Microsoft article “XML Web Services Basics’’ at http://msdn.microsoft.com/library/en-us/dnwebsrv/html/webservbasics.asp. You’ll find an even more basic and technology-neutral introduction in Venu Vasudevan’s Web services article “A Web Service Primer” at http://webservices.xml.com/pub/a/ws/2001/04/04/webservices/index.html. Installing XMLA To use XMLA with SQL Server 2000, download the XML for Analysis Software Development Kit (SDK), available at http://www.microsoft.com/downloads/details.aspx?familyid=7564a3fd-4729 -4b09-9ee7-5e71140186ee&displaylang=en, and install it on a Web server that can access your Analysis Services data source through OLE DB for OLAP. (You can simply use the server that has Analysis Services installed on it.) SQL Server 2005 Analysis Services will support XMLA as a native protocol, so you won’t have to separately install XMLA. But for now, this step is necessary. Installing the SDK is straightforward, but to run the installer, you must be logged on as an Administrator to the machine on which you’re performing the installation. When you double-click the XMLADSK.msi installation package, the installer walks you through the process. Unless you have a Secure Sockets Layer (SSL) certificate configured on your Web server, you need to select Enable HTTP and HTTPS during the Connection Encryption Settings step to allow your SQL Server unsecured communication with the XMLA Provider through HTTP. Note that using the XMLA Provider in unsecured mode isn’t a good idea for a production system because the provider will pass your data across the network in plain text for anyone to intercept. But for just learning about XMLA in a non-production environment, you’re probably OK using unsecured communication. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 47 After installing the SDK, you need to set up the data sources that you’re going to connect to through XMLA and make the server available to clients by creating a virtual directory for the XMLA Provider. To set up the data sources, you edit the datasources.xml file in the Config subfolder of the installation folder you selected when installing the provider. The default path for installation is C:\Program Files\Microsoft XML for Analysis SDK\. The datasources.xml file contains a preconfigured example connection for the Local Analysis Server that you can copy to set up your own data sources. Figure 1 shows part of the datasources.xml file. The most important parts of this file are the required elements that facilitate the connection to the OLAP data source: DataSourceName for naming the data source; DataSourceDescription for adding a text description of the data source; URL, which provides the URL for the XMLA Provider; DataSourceInfo, which describes the OLE DB for OLAP connection to the Analysis Servers; ProviderType, which enumerates the type or types of provider being referenced—tabular data provider (TDP), multidimensional data provider (MDP), data-mining provider (DMP); and AuthenticationMode (Unauthenticated, Authenticated, or Integrated), which describes how the Web service will authenticate connections to the provider. The XML for Analysis Help file (which you installed with the SDK at \Microsoft XML for Analysis SDK\Help\1033\smla11.chm) contains complete information about all these configuration options. Figure 1 Partial datasources.xml file Brought to you by Microsoft and Windows IT Pro eBooks 48 A Jump Start to SQL Server BI Once you’ve set up the data sources, you need to create in Microsoft IIS a virtual directory for the XMLA Provider. The virtual directory lets IIS access a specific folder on the server through HTTP, which is how we’ll connect to the XMLA Provider for this example. The easiest way to set up a virtual directory is to open the IIS Manager, select the server on which you want to create the virtual directory, right-click the Web site you want to use for the XMLA Provider, and select New, Virtual Directory. The Virtual Directory Creation Wizard then guides you through the rest of the process. The first step is to name the virtual directory; XMLA is usually a good choice. Next, you select the content directory, which lets IIS map files in that directory to HTTP requests. For the XMLA Provider, the content directory is the path to the Msxisapi.dll file installed in the C:\Program Files\Microsoft XML For Analysis SDK\Isapi folder (the default location) during setup. Then, set the access permissions for this folder by selecting the Read, Run Scripts, and Execute check boxes, and finish the wizard. After you configure the virtual directory, you set access permissions on it. In IIS Manager, right-click the virtual directory you just created and select Properties. In the Properties window, select the Directory Security tab and configure the security permissions. For learning about how XMLA works, the default permissions setting (anonymous access) is sufficient. If you’re configuring the XMLA Provider on Windows Server 2003, you must take some additional steps to enable the protocol on the server. The XMLA Help topic “Enable the XML for Analysis Web Service Extension on Windows Server 2003” tells you how to get the XMLA Provider to work on Windows Server 2003. Using XMLA: Discover and Execute One of XMLA’s greatest strengths is that it simplifies data retrieval compared to working directly with OLE DB for OLAP. The XMLA Provider has only two methods: Discover and Execute. You use the Discover method to retrieve metadata that describes the services a specific XMLA Provider supports. You use the Execute method to run queries against the Analysis Services database and return data from those queries. Discover. Discover is a flexible method that a client can use repeatedly to build a picture of the configuration and capabilities of the data provider. So, for example, a client might first request the list of data sources that are available on a particular server, then inquire about the properties and schemas those data sources support so that a developer can properly write queries against the data source. Let’s look at the arguments you send to Discover, then walk through some examples that show how to use the method. Listing 1’s XML code shows a SOAP call to retrieve a list of data sources from the server. The first parameter, RequestType, determines the type of information that Discover will return about the provider. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 49 Listing 1 XML code using a SOAP call <?xml version=”1.0” encoding=”utf-8”?> <soap:Envelope xmlns:soap=”http://schemas .xmlsoap.org/soap/envelope/” xmlns:xsi=”http://www.w3.org/ 2001/XMLSchema-instance” xmlns:xsd=”http://www.w3.org/ 2001/XMLSchema”> <soap:Body> <Discover xmlns=”urn:schemas microsoft-com:xml-analysis”> <RequestType>DISCOVER_ DATASOURCES</RequestType> <Restrictions/> <Properties/> </Discover> </soap:Body> </soap:Envelope> The available types let you get a list of the data sources available on the server (DISCOVER_DATASOURCES), a list of properties about a specific data source on the server (DISCOVER_PROPERTIES), a list of supported request types (DISCOVER_SCHEMA_ROWSETS), a list of the keywords the provider supports (DISCOVER_KEYWORDS), and a schema rowset constant to retrieve the schema of a provider-defined data type. Table 1 lists the RequestType parameters. TABLE 1: RequestType Parameters Parameter Name DISCOVER_DATASOURCES DISCOVER_PROPERTIES DISCOVER_SCHEMA_ROWSETS DISCOVER_ENUMERATORS DISCOVER_KEYWORDS DISCOVER_LITERALS Description A list of data sources available on the server. A list of information and values about the requested properties that the specified data source supports. The names, values, and other information of all supported RequestTypes enumeration values and any additional provider-specific values. A list of names, data types, and enumeration values of enumerators that a specific data source’s provider supports. A rowset containing a list of keywords reserved by the provider. Information about literals the data source provider supports. Schema Rowset Constant The schema rowset that the constant defines. The second parameter, Restrictions, lets you put conditions on the data that Discover returns. The RequestType in the call to the Discover method determines the fields that the Restrictions parameter can filter on. Table 2 describes the fields that the various schema types in XMLA can use to restrict returned information. If you want to return all the data available for a given RequestType, leave the Restrictions parameter empty. Brought to you by Microsoft and Windows IT Pro eBooks 50 A Jump Start to SQL Server BI TABLE 2: Fields That XMLA Schema Types Can Use to Restrict Data the Discover Method Returns Request Type DISCOVER_DATASOURCES Field DataSourceName URL ProviderName ProviderType Description The name of the data source (e.g., FoodMart 2000). The path XMLA methods use to connect to the data source. The name of the provider behind the data source. An array of one or more of the provider-supported data types: MDP for multidimensional data provider, TDP for tabular data provider, and DMP for data mining provider. AuthenticationMode The type of security the data source uses. Unauthenticated means no UID or password is needed. Authenticated means that a UID and password must be included in the connection information. Integrated means that the data source uses a built-in facility for securing the data source. DISCOVER_PROPERTIES PropertyName An array of the property names. DISCOVER_SCHEMA_ROWSETS SchemaName The name of the schema. DISCOVER_ENUMERATORS EnumName An array of the enumerator’s names. DISCOVER_KEYWORDS Keyword An array of the keywords a provider reserves. DISCOVER_LITERALS LiteralName An array of the literals’ names. The Properties parameter provides additional information about the request that the other parameters don’t contain. For example, Timeout specifies the number of seconds the provider will wait for the Discover request to succeed before returning a timeout message. Table 3 lists some common XMLA Provider for Analysis Services properties you’re likely to use. You can specify properties in any order. If you don’t specify a Properties value, Discover uses the appropriate default value. TABLE 3: Common Properties Available in the XMLA Provider for Analysis Services Property AxisFormat BeginRange Catalog DataSourceInfo EndRange Password ProviderName Timeout UserName Default TupleFormat Description The format for the MDDataSet Axis element. The format can be either TupleFormat or ClusterFormat. -1 (all cells) An integer value that restricts the data set a command returns to start at a specific cell. Empty string The database on the Analysis Server to connect to. Empty string A string containing the information needed to connect to the data source. -1 (all data) An integer value that restricts the data set a command returns to end at a specific cell. Empty string A string containing password information for the connection. Empty string The XML for Analysis Provider name. Undefined A numeric timeout that specifies in seconds the amount of time to wait for a connection to be successful. Empty string A string containing username information for the connection. The Discover method call in Listing 1 returns results in XML. The settings you give the parameters RequestType, Restrictions, and Properties determine the contents of Result, which is an output parameter. In Listing 1, note that I set RequestType to DISCOVER_DATASOURCES and Restrictions and Properties to null so that Discover returns the entire list of data sources in the default format (tabular format in this case). To call a SOAP method, you have to send the SOAP envelope to the Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 51 Web service through HTTP. I’ve provided a sample Web application, which you can download at InstantDoc ID 44006. The sample application shows exactly how you might send a SOAP envelope in JScript by using the Microsoft.XMLHTTP object in the SubmitForm() method. The sample also shows you more examples of how to use the Discover method and how to use the data-source information retrieved from the first call to Discover to populate the next call to Discover. Execute. After you use Discover to determine the metadata for the data source, you can use that metadata to retrieve data. For data retrieval, XMLA provides the Execute method. The method call for Execute looks like this: Execute (Command,Properties, Results) As Listing 2’s SOAP call to Execute shows, the Command parameter contains in a <Statement> tag the MDX statement you want to run against your OLAP server. Similar to the Properties parameter in the Discover method, the Properties parameter in Execute provides additional information that controls the data the method returns or the connection to the data source. You must include the Properties tag in your Execute method call, but the tag can be empty if you want to use the defaults for your request. The Results parameter represents the SOAP document the server returns. Results’ contents are determined by the other two parameters. Listing 2 SOAP call to Execute <?xml version=”1.0” encoding=”utf-8”?> <soap:Envelope> <soap:Body> <Execute xmlns=”urn:schemas-microsoft-com:xml-analysis”> <Command> <Statement>select {[Product].children} on rows, {[Store].children} on columns from Sales </Statement> </Command> <Properties> <PropertyList> <DataSourceInfo> Provider=MSOLAP;Data Source=local </DataSourceInfo> <Catalog>FoodMart 2000</Catalog> <Format>Multidimensional</Format> <AxisFormat>TupleFormat</AxisFormat> </PropertyList> </Properties> </Execute> </soap:Body> </soap:Envelope> Brought to you by Microsoft and Windows IT Pro eBooks 52 A Jump Start to SQL Server BI Listing 2’s code shows an example of a call to Execute that contains an MDX SELECT statement. You call the Execute method the same way you call the Discover method, by sending the SOAP envelope to the Web service through HTTP. As with any SOAP request, the entire message is contained in a SOAP envelope. Within the SOAP envelope, the SOAP body contains the guts of the Execute method call, starting with the Command parameter. The Command parameter contains the MDX query that will run on the server. The Properties parameter comes next, containing the PropertyList parameter that holds each of the properties the XML code will use for the Execute request. In this case, the Execute call specifies in the PropertyList parameter DataSourceInfo, Catalog, Format, and AxisFormat. You can retrieve all this information in a call to Discover like the one that Listing 1 shows. Finally, you close the body and envelope, and the request is ready to send via HTTP to the XMLA Provider. Getting Results When the XMLA Provider receives a request, it passes the request to the MDX query engine, which parses and executes it. After obtaining the MDX results, the XMLA Provider packages them into a SOAP reply and sends them back to the requesting client. An Execute response can be quite long depending on the amount of data returned and the format used. To see the results of an Execute query, load the sample application and run an MDX query. To load the sample application, simply open it in Internet Explorer (IE). You can either copy the file to a virtual directory and open it in HTTP or double-click the file to open it in the browser. You’ll see all the XML that the query returned in the sample Web application; Figure 2 shows part of the results. Figure 2 Sample Application Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 53 The SOAP response from a call to an Execute method looks similar to the results from a call to Discover. As Listing 2 shows, the calling code includes the usual SOAP Envelope and Body tags as the top-level wrappers, then shows the MDX query packaged for transmission in XML. You have two options for the format of an Execute request’s results: Rowset and MDDataSet (which appears as Multidimensional in the listing). The Rowset format is a flattened tabular structure that contains rows and columns along with the data elements. MDDataSet is a multidimensional format that contains three sections: OLAPInfo, Axes, and CellData. You’ll see these three sections if you scroll through the results of the sample application. The multidimensional format represents the multidimensional data in a hierarchical format that’s more representative of the structure of the data than the flattened tabular format. OLAPInfo defines the structure of the results. The first section of OLAPInfo, CubeInfo, lists the cubes where the data originated. Next, AxesInfo has an AxisInfo element for each axis in the data. Every AxisInfo element contains the hierarchies, members, and properties for that axis. AxisInfo always contains the standard properties Uname (Unique Name), Caption, Lname (Level Name), and Lnum (Level Number). In addition, AxisInfo might contain a default value specified for cell properties. If the query results include many repeating values, these default values can dramatically reduce the size of the returned data by returning only the data elements that are different from the default. Last, the CellData section of a multidimensional format contains CellInfo standard and custom properties for each cell the MDX query returns. The standard properties are Value, FmtValue (Format Value), ForeColor, and BackColor. Optional properties depend on the MDX query you use to retrieve the results. Describing XMLA results in abstract terms is difficult because the exact data returned varies depending on the query you use. The easiest way to understand OLAPInfo is to walk through an example of the results from a specific query. Consider the following MDX query: select {[Product].children} on rows, {[Store].children} on columns from Sales Running this query through the XMLA Provider by using the Execute method results in the AxesInfo section that Figure 3 shows. The query returns columns (Axis0) and rows (Axis1). Each axis contains only one hierarchy: The columns axis contains the Store hierarchy, and the rows axis contains the Product hierarchy. After defining the dimensional axes, Figure 3 shows the slicer dimension, which is an MDX dimension for filtering multidimensional data. Slicer dimensions appear in the WHERE clause of an MDX query and display every hierarchy in the cube that doesn’t appear in the dimensional axes. The repetition of this information is useful in XMLA because you can use the information to show which other hierarchies are available in a given cube and write further queries against those hierarchies. Brought to you by Microsoft and Windows IT Pro eBooks 54 A Jump Start to SQL Server BI Figure 3: AxesInfo section resulting from the Execute call <AxesInfo> <AxisInfo name=”Axis0”> <HierarchyInfo name=”Store”> <UName name=”[Store].[MEMBER_UNIQUE_NAME]” /> <Caption name=”[Store] .[MEMBER_CAPTION]” /> <LName name=”[Store] .[LEVEL_UNIQUE_NAME]” /> <LNum name=”[Store].[LEVEL_NUMBER]” /> <DisplayInfo name=”[Store] .[DISPLAY_INFO]” /> </HierarchyInfo> </AxisInfo> <AxisInfo name=”Axis1”> <HierarchyInfo name=”Product”> <UName name=”[Product].[MEMBER_UNIQUE_ NAME]” /> <Caption name=”[Product] .[MEMBER_CAPTION]” /> <LName name=”[Product] .[LEVEL_UNIQUE_NAME]” /> <LNum name=”[Product] .[LEVEL_NUMBER]” /> <DisplayInfo name=”[Product] .[DISPLAY_INFO]” /> </HierarchyInfo> </AxisInfo> <AxisInfo name=”SlicerAxis”> <HierarchyInfo name=”Measures”> ... </HierarchyInfo> <HierarchyInfo name=”Time”> ... </AxisInfo> </AxesInfo> As I noted earlier, the last part of the OLAPInfo section of a multidimensional format, CellInfo, describes the properties the query will return for each cell in the result set. Because the query I use Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 6 XML for Analysis: Marrying OLAP and Web Services 55 in this example doesn’t specify any additional properties, the CellInfo section displays only the basic Value and FmtValue information: <OlapInfo> <!- the AxesInfo goes here - > <CellInfo> <Value name=”VALUE” /> <FmtValue name=”FORMATTED_VALUE” /> </CellInfo> </OlapInfo> The next section of the results in MDDataSet format is Axes, which contains the data the query returns organized in either TupleFormat, as Figure 4 shows, or Cluster-Format. Figure 4 Query data organized in TupleFormat Brought to you by Microsoft and Windows IT Pro eBooks 56 A Jump Start to SQL Server BI Let’s look at an example to see the differences between these two formats. Say you have three country categories (Canada, Mexico, and USA) and three product categories (Drink, Food, and Non-Consumable), which produce nine combinations of countries and products. Logically, you have several options for representing this set in a written notation. First, you can simply list the combinations: {(Canada, Drink), (Canada, Food), (Canada, Non-Consumable), (Mexico, Drink), (Mexico, Food), (Mexico, Non-Consumable), (USA, Drink), (USA, Food), (USA, Non-Consumable)} This is the kind of set representation that the TupleFormat uses. Each pair is a tuple, and each tuple contains a member from each dimension you included in the results. So if you had three dimensions in the query, the resulting tuple would have three members. Alternatively, you can use a mathematical representation of the combinations of the two sets. Using the concept of a Cartesian product, you can represent the set of data as: {Canada, Mexico, USA} x {Drink, Food, Non-Consumable} The Cartesian product operator (x) between the two sets represents the set of all possible combinations of the two sets. The ClusterFormat uses this representation. And although this is a much more compact representation, it requires more interpretation to understand and navigate. The last section in MDDataSet is CellData, which contains values for each cell the MDX query returns. An ordinal number in a zero-based array refers to the cells. (To learn how to calculate ordinal numbers, see the Web sidebar “Mapping the Tuple Ordinals” at InstantDoc ID 44007.) If a cell isn’t present in the array, the default value from AxisInfo serves as the value for the cell. If no default value is specified, the value is null. A Convenient Marriage This chapter has introduced XMLA as a Web services layer that uses SOAP to tap into OLAP data. XMLA provides the basis for standards-based, Internet-ready analytic applications, which can be easily deployed and shared across and among enterprises. By using the XML for Analysis SDK, you can use XMLA today in SQL Server 2000 Analysis Services (or in other vendors’ platforms), and XMLA will be a core part of the SQL Server 2005 Analysis Services platform. With its flexibility and broad support, XMLA is an excellent tool for current or future analytic application projects. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 7 Improving Analysis Services Query Performance 57 Chapter 7: Improving Analysis Services Query Performance By Herts Chen Analysis Services is a high-performance, multidimensional query engine for processing analytical and statistical queries, which a relational SQL engine doesn’t handle well. When such queries are simple or have pre-aggregations, Analysis Services can make your job easier. But when queries become complex, Analysis Services can bog down. For example, an SQL SELECT statement that includes a GROUP BY clause and aggregate functions can take as long as a few minutes—or more. You can retrieve the same result set in just a few seconds if you execute an MDX statement against an Analysis Services Multidimensional OLAP (MOLAP) cube. You perform this workaround by passing an MDX query from SQL Server to a linked Analysis Server by using the OPENQUERY function in an SQL SELECT statement, as SQL Server Books Online (BOL) describes. Analysis Services then precalculates the necessary aggregations during the processing and creation of the MOLAP cube so that the results are completely or partially available before a user asks for them. However, precalculating every imaginable aggregation is impossible; even a completely processed MOLAP cube can’t precalculate aggregations such as those in calculated cells, calculated members, custom rollup formulas, custom member formulas, FILTER statements, and ORDER statements. If you’re used to the performance you get when you retrieve only precalculated aggregations, the performance you get from an MDX query that involves these kinds of runtime calculations might seem unbearably slow. The problem might occur not because Analysis Services can’t handle runtime calculations efficiently but because your MOLAP cube’s design isn’t optimal. In my work building and maintaining a data warehouse for the city of Portland, Oregon, I optimize Analysis Services so that traffic engineers can quickly access a variety of statistics about traffic accidents in the city. Through many experiments, I’ve discovered that an important key to MOLAP optimization is cube partitioning. In this chapter, I explore and compare various MOLAP cube-partitioning strategies and their effects on query performance. Then, I suggest some simple guidelines for partition design. Traffic-Accident Data Warehouse My study of query performance is based on my work with a real dimensional data warehouse that maintains traffic-accident history. When I conducted this study, the traffic-accident data warehouse contained 17 years of data (1985 through 2001) and documented about 250,000 unique incidents. The complex part of this data warehouse is not its relatively small fact table but its many dimensions, which the snowflake schema in Figure 1 shows. Brought to you by Microsoft and Windows IT Pro eBooks 58 A Jump Start to SQL Server BI Figure 1 The data warehouse’s many dimensions Portland’s traffic engineers look for the street intersections that have the highest number of incidents. Then, they search for clues about which factors might cause a high number of crashes and what makes some accidents more severe than others. They look at a total of 14 factors (which are the data warehouse’s dimensions) including time, light, weather, traffic control, vehicle, and severity of occupant injuries. Among the dimensions, the Streets dimension (STREET_DIM) is the largest; it records roughly 30,000 street intersections in the Portland area. The total number of source records to build aggregations on is the result of a multi-way join of 14 one-to-many (1:M) or many-to-many (M:N) relationships from the fact table to the dimension tables. The Accident data warehouse contains only one measure: the distinct accident count (Incident_Count). A distinct count prevents the possibility of counting the same accident multiple times in a M:N relationship. Fortunately, the Streets dimension isn’t too large to use MOLAP cube storage, which provides the best query performance. Analysis Services defines a huge dimension as one that contains more than approximately 10 million members. Analysis Services supports huge dimensions only with Hybrid OLAP (HOLAP) or Relational OLAP (ROLAP) cubes. Queries and Bottlenecks Analysis Services responds to queries with varying performance, depending on the complexity of the query. For example, a MOLAP cube that you create by accepting the default partition in Analysis Manager would respond to a simple query like the one that Listing 1 shows by returning roughly 2000 records in only 5 seconds. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 7 Improving Analysis Services Query Performance 59 Listing 1: Simple Query That Doesn’t Use Calculated Members SELECT { [Occupant_Severity].[All Occupant_Severity] } ON COLUMNS , {ORDER( { FILTER( [Street].[Street_List].Members, [Occupant_Severity].[All Occupant_Severity] >= 20 ) }, [Occupant_Severity].[All Occupant_Severity], BDESC ) } ON ROWS FROM [Default1] If your queries basically ask only for pre-aggregates in a few records or columns, any MOLAP cube with any percentage of aggregation—even as little as 5 percent—will perform well. However, for a query like the one that Listing 2 shows, which involves six calculated members, a 30 percentaggregated, single-partition MOLAP cube would take 52 seconds to return just 331 street intersections. These disparate results suggest that performance bottlenecks don’t depend on the size of the result set or on the percentage of aggregation in the cube. In fact, in my experience, any aggregations beyond 30 percent are a waste—you get no better performance for your effort. For simple queries, you don’t need high aggregation. For complex queries, high aggregation won’t help. Performance bottlenecks in Analysis Services typically come from calculated members that scan for multiple tuples and aggregate them on the fly. Listing 2: Complex Query Containing 6 Calculated Members -- Returns 7 columns in 331 records. WITH MEMBER Time.[Accident_Count] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]}, [Occupant_Severity].[All Occupant_Severity])’ MEMBER Time.[Fatal] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]}, [Occupant_Severity].&[Fatal])’ MEMBER Time.[Injury_A] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]}, [Occupant_Severity].&[Injury A, Incapacitating])’ MEMBER Time.[Injury_B] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]}, [Occupant_Severity].&[Injury B, Non-Incapacitating])’ MEMBER Time.[Injury_C] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]}, [Occupant_Severity].&[Injury C, Possible Injury])’ MEMBER Time.[PDO] AS ‘Sum({Time.[1998], Time.[1999], Time.[2000]}, [Occupant_Severity].[PDO])’ SELECT {Time.[Accident_Count], Time.[Fatal], Time.[Injury_A], Time.[Injury_B], Time.[Injury_C], Time.[PDO]} ON COLUMNS , {ORDER( { FILTER([Street].[Street_List].Members, (Time.[Accident_Count]) >= 20 ) }, (Time.[Accident_Count]), BDESC )} ON ROWS FROM [Default1] The City of Portland traffic engineers I work with typically ask ad hoc questions that are nonhierarchical along the Time dimension. For example, an engineer might ask me to calculate the total number of accidents during the past 3 years, the past 5 years, or any combination of years between 1985 and 2001. I can’t simply aggregate years by creating a new level above the Year level in the Time dimension; the new level would satisfy only one combination of years. This limitation Brought to you by Microsoft and Windows IT Pro eBooks 60 A Jump Start to SQL Server BI means all queries that involve a combination of years have to use calculated members to perform aggregations for the specified years. Listing 2’s query returns accident counts along the Time, Occupant_Severity, and Streets dimension members. Figure 2 shows the members of the Time and Occupant_Severity dimensions. Figure 2 Members of the Time and Occupant_Severity dimensions Listing 2’s query uses six calculated members—Accident_Count, Fatal, Injury_A, Injury_B, Injury_C, and PDO (Property Damage Only)—to sum the accidents in the years 1998, 1999, and 2000 for each of the five members of the Occupant_ Severity dimension. The query asks for a sorted and filtered result set of accident counts for each street intersection ([Street].[Street_List]) in each of these six calculated members. To contrast with the performance of such on-the-fly aggregation, I’ve included Listing 3, which accesses only pre-aggregations and doesn’t include calculated members. I used Listing 2 and Listing 3 as the benchmarks for my cube partitioning tests, which I discuss in a moment. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 7 Improving Analysis Services Query Performance 61 Listing 3: Query That Returns the Same Columns as Listing 2 Without Using Calculated Members -- Returns 7 columns in 2239 records. SELECT { [Occupant_Severity].[All Occupant_Severity],[Occupant_Severity].&[Fatal], [Occupant_Severity].&[Injury A, Incapacitating], [Occupant_Severity].&[Injury B, Non-Incapacitating], [Occupant_Severity].&[Injury C, Possible Injury], [Occupant_Severity].&[PDO]} ON COLUMNS, {ORDER( { FILTER( [Street].[Street_List].Members, [Occupant_Severity].[All Occupant_Severity] >= 20 ) }, [Occupant_Severity].[All Occupant_Severity], BDESC ) } ON ROWS FROM [Default1] When you need to improve the performance of queries that involve calculated members, cube design is important. In my experience, the most important aspect of cube design isn’t how much memory you have, the number of disks or CPU threads you have, whether you use unique integers for member keys, or even whether you use the Usage-Based Optimization Wizard, but how you partition the cube. Partitioning is slicing a cube along a tuple such as ([Occupant_Severity].[Fatal], [Time].[2000]), which specifies a member from each dimension. For any dimension that you don’t specify in this tuple, the partition includes the entire dimension. Analysis Services keeps in the cube structure a direct pointer or index to the partition for that tuple. Whenever a query references that tuple or a subset of it, Analysis Services can get to the corresponding partition without scanning the entire cube. You can partition a cube in a nearly infinite number of ways, and Analysis Services supports as many partitions as you practically need for a cube. But without a clear rule for creating partitions, you could create too many cuts or wrong cuts on a cube and end up with worse performance than you’d get with one default partition. Usage-Based Partitioning You can partition a cube along any tuple of members at any level from any dimension. Analysis Services’ Partition Wizard calls such a tuple a data slice. Although Analysis Services can scan a small partition faster than a large one, a small partition contains fewer members. Analysis Services might have to perform more scans of multiple small partitions to cover the same number of members that one larger partition could contain. So the overhead of performing calculations on the results of multiple partitions might negate the advantage of the faster scan in each smaller partition. How you partition a cube depends on the queries you need to process. Logically, you might decide to partition along every tuple that a query specifies. For example, to improve Listing 2’s performance, you might be tempted to partition along each tuple of the cross join of {Time.[1998], Time.[1999], Time.[2000]}, [Occupant_Severity].Members (6 members), and [Street].[Street_List].Members (roughly 30,000 members). You’d create partitions along a total of 540,000 tuples (3 x 6 x 30,000 = 540,000). This seemingly simple plan creates two problems. First, scanning 540,000 partitions and summing the 3 years for each tuple of severity and street (a total of 180,000 tuples) would create significant performance overhead. Second, the amount of work and time to create and process Brought to you by Microsoft and Windows IT Pro eBooks 62 A Jump Start to SQL Server BI 540,000 partitions, manually or programmatically by using Decision Support Objects (DSO), is astronomical. The excessive performance overhead you create when you partition along every tuple in a query is a serious concern for a couple of reasons. First, the query in Listing 2 isn’t supposed to return each year individually. Instead, the query should return only the sum of incidents in 3 years. An efficient partition would include the three specified years so that Analysis Services could calculate the sum solely within the partition. Second, the query doesn’t need to access just one street intersection; it has to scan all the street intersections regardless of the partitions you create. Being able to get to a particular street partition directly doesn’t improve performance because you still have to walk through every partition. You’d be better off keeping all the street intersections in the same partition. The bottom line is that you should partition along a tuple only when the partition can save your query from doing a sequential scan for that tuple. Partition Testing To see what kinds of partitions avoid a sequential scan, I devised tests that use Listing 2 and Listing 3 as benchmarks. In the rest of this chapter, I summarize the tests and some important results. I created six cubes of identical structure with 30 percent aggregation and varying partition designs. I wanted to partition along the Time and Occupant_Severity dimension members (which Figure 2 shows) that the test queries in Listing 2 and Listing 3 are looking for so that they can get to those members with no scan or a faster scan. Table 1 describes the partitioning data slices of these six test cubes. I gave the cubes names that reflect their partitioning dimensions and total number of partitions. TABLE 1: Test Cubes and Their Partitioning Data Slices Cube Default1 Severity6 PartitionYear2 Year6 PartitionYear_Severity7 Year_Severity31 Partitioning Data Slice (Tuple) Entire cube in one default partition Partition at each of the [Severity Header].Members tuples—for example, ([Fatal]), ([PDO]) Partition at each of the [Partition Year].Members tuples—for example, ([1]), ([2]) Partition at the [Partition Year].[1] tuple and each of the [Partition Year].[2].Children tuples—for example, ([1997]), ([1998]) Partition at the [Partition Year].[1] tuple and each of the CrossJoin({[Partition Year].[2]}, [Severity Header]. Members) tuples—for example, ([2], [Fatal]), ([2], [PDO]) Partition at the [Partition Year].[1] tuple and each of the {CrossJoin([Partition Year].[2].Children, [Severity Header]. Members) tuples—for example, ([1997], [Fatal]), ([1997], [PDO]) To study the effect of the number and speed of CPUs, disk I/O, and physical memory on partitioned cubes, I repeated the same tests on six different Dell servers. Table 2 shows the specifications for these servers, ranging from the highest end to the lowest end in hardware resources. High1, High2, and High3 are high-end production-scale servers; Low1 and Low2 are desktops; and Low3 is a laptop (which I used as a server for the sake of testing). Each test executes Listing 2 and Listing 3 from the client machine Low2 against every test cube on all six servers. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 7 Improving Analysis Services Query Performance 63 TABLE 2: Test Server Specifications CPU (MHz) # of CPUs Average disk speed (MB/sec) RAM (GB) High1 549 8 11 4 High2 549 8 10 4 High3 499 4 16 1 Low1 1994 2 7 4 Low2 1694 1 8 0.5 Low3 1130 1 8 0.5 All the tests measured the response times of Listing 2 and Listing 3. Figure 3 shows Listing 2’s performance on all the servers. Figure 3 Listing 2’s performance • • • • • I drew the following conclusions for Listing 2: High-end servers (with multiple low-speed CPUs) performed worse than low-end servers (with one high-speed CPU) regardless of cube partitioning. CPU speed—rather than the number of CPUs, disk speed, or amount of memory—drives performance. Effective partitioning makes the query perform 5 to 10 times faster than on the default partition, especially on slower CPUs. Queries that have calculated members, such as the one in Listing 2, are CPU-bound. Partitioning along queried data slices, as I did in the Year_Severity31 and PartitionYear_Severity7 test cubes, gives the best performance. Slicing along queried members (e.g., slicing along the six members of the Severity dimension and the three members of the Year level of the Time dimension) prevented sequential scans. Brought to you by Microsoft and Windows IT Pro eBooks 64 A Jump Start to SQL Server BI • Minimizing partition sizes by excluding members that you don’t query frequently (e.g., [Partition Year].[1], which includes the years 1985 through 1996) doesn’t prevent a sequential scan but does speed up the scan. • Test results show that an aggregation level higher than 5 percent has no effect on performance, which proves my hypothesis that high aggregation levels are a waste of effort. Guidelines for Partitioning Based on the results of my tests and the conclusions I’ve drawn, I offer these partition- design guidelines. For all queries: • Never overlap partitions. • Never specify the [All] member as the partition data slice because doing so creates overlapping partitions. For queries like the one in Listing 3, which accesses only pre-aggregations: • No partitioning is necessary because its effect is negligible or negative. • Apply Analysis Services’ Usage-Based Optimization. For queries like the one in Listing 2, which calculates many aggregations on the fly: • Partition along queried data slices—for example, ([Partition Year].[2] .[1997], [Fatal]). • No Usage-Based Optimization is necessary because it has no effect. • Five percent aggregation is the maximum aggregation level that provides performance improvements. If you have multiple slow queries that have different partitioning needs, consider creating different cubes for each query. For desktop ad hoc users who can retrieve just one screen of results at a time, using multiple cubes might be inconvenient. However, for custom applications (such as Web and reporting applications) that need complete results, you have the full control of accessing multiple cubes or even multiple databases behind the scenes for the best performance. The term “tuning” implies that you’ll have to experiment to achieve the optimal performance for your system. The techniques and guidelines that this chapter offers won’t necessarily create optimal performance right away, but if you take the time to examine your query usage and identify the slow queries, estimate which partitions might prevent sequential scans, and test those partitions, you’ll get closer to the performance you want. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 8 Reporting Services 101 65 Chapter 8: Reporting Services 101 By Rick Dobson SQL Server 2000 Reporting Services, the SQL Server-based enterprise reporting solution Microsoft released in January 2004, is positioned to become one of the most popular SQL Server components. Nearly all organizations need to produce reports from their data, and with Reporting Services, Microsoft filled this large hole in SQL Server’s toolkit. You can install Reporting Services on any SQL Server 2000 computer at no additional cost, and you’ll be able to install it as part of SQL Server 2005. In spite of the solution’s benefits and the excitement surrounding its initial release, many SQL Server professionals have limited or no hands-on experience with Reporting Services. If you’re like many database professionals, you might have put off using Reporting Services because of its relative newness, the fact that it requires a separate SQL Server installation that works along with your production SQL Server, or maybe its list of prerequisites. But Reporting Services isn’t so new any more, and Microsoft has released Reporting Services Service Pack 1 (SP1), which fixes the bugs in the initial release. In addition, Microsoft is integrating Reporting Services with SQL Server 2005, so learning how to use Reporting Services now will give you a head start on SQL Server 2005. This chapter gives you the basics for getting started with Reporting Services and includes SP1 examples that you can reproduce in your test environment. I start by giving you the prerequisites for using Reporting Services and explaining where to get it. Then, I walk you through the steps for authoring two reports and for deploying those reports to the Report Server, Reporting Services’ main component. Finally, I teach you two ways to view deployed reports. Installing Reporting Services To properly install Reporting Services, your system needs four elements. First, you need Windows Server 2003, Windows XP, or Windows 2000 with the most recent service packs installed. Second, you need Microsoft IIS because Reporting Services runs as an XML Web service. Third, you need the standard, enterprise, or developer edition of SQL Server 2000. (Reporting Services isn’t compatible with earlier SQL Server releases.) Fourth, report designers need Visual Studio .NET 2003, which hosts Reporting Services’ Report Designer component. (For administrators who don’t design reports, Reporting Services provides a different UI that permits the creation of folders, data sources, and users and the assignment of permissions to users.) After you make sure your system meets the prerequisites, you can install Reporting Services, then install SP1 to update the initial release. You can download a trial version of Reporting Services at the URL in Related Reading. Creating Your First Report The only Report Designer Microsoft offers for authoring Reporting Services reports is in Visual Studio .NET 2003. When you install Reporting Services, the installation process automatically updates Visual Studio .NET by adding a new project type called Business Intelligence Projects. You don’t necessarily Brought to you by Microsoft and Windows IT Pro eBooks 66 A Jump Start to SQL Server BI need to have Visual Studio .NET installed on the same server as Reporting Services. As I explain in a moment, you can reference a target-server URL for Reporting Services, which can be different from the location of the workstation you use to run Visual Studio .NET. Within this project type are two templates named Report Project Wizard and Report Project. Both templates let you perform the steps to create a report: defining a report’s data source, specifying a report’s layout, previewing a report, and deploying a finished report to the Report Server. To create your first report, start a new Business Intelligence project in Visual Studio .NET, and choose the Report Wizard Project template. Name your project SSMRS-Intro. Read the wizard’s welcome screen, then click Next to go to the Select the Data Source screen and specify the report’s data source. Click Edit to open the familiar Data Link Properties dialog box that Figure 1 shows. On the dialog box’s Provider tab, select Microsoft OLE DB Provider for SQL Server as the type of data you want to connect to. Figure 1 Data Link Properties dialogue box As Figure 1 shows, the dialog box’s Connections tab lets you specify on the local SQL Server instance a Windows NT Integrated security-based connection to the Northwind database. Click Test Connection, then click OK to return to the Select the Data Source screen, which now shows a connection string that points to a data source named after the database. Note that unless you select the Make this a shared data source check box at the bottom of the screen, the wizard embeds the data source so that you can use it exclusively for this one report. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 8 Reporting Services 101 67 Clicking Next opens the wizard’s Design the Query screen. You can either type an SQL query statement into the Query string text box or click Edit to open a graphical query designer that operates like the query builder in Enterprise Manager. For this example, you can use the following query: SELECT FROM WHERE CompanyName, Customers (Country = (Country (Country ContactName, Phone, Country ‘Canada’) OR = ‘Mexico’) OR = ‘USA’) Then, click Next to open the Select the Report Type screen. The wizard offers two report types: tabular and matrix. The matrix type is for a cross-tab report, which we won’t create in this chapter’s examples. For this demonstration, select Tabular.Figure 2 shows the next wizard screen, Design the Table, which lets you put the query fields where you want them in the report. Click Details to move the field names from the Available fields list box to the Details list box. These selections cause the fields to appear in a report’s Details section. You can optionally create additional groupings around the Details section by adding fields to the Group list box. Clicking Next opens the Choose the Table Style screen. You can accept the default selection of Bold or highlight one of the other report styles. A preview window gives you a feel for how the different styles present your data. Figure 2 Design the Table wizard screen Brought to you by Microsoft and Windows IT Pro eBooks 68 A Jump Start to SQL Server BI When you’re running the Report Wizard for the first time in a project, the Choose the Deployment Location screen appears next. The wizard automatically populates the Report Server and Deployment folder text boxes. Because the Report Server for this chapter’s examples runs from the local IIS Web server, the Report Server text box shows the path http://localhost/ReportServer. During installation, you specify the name of the Web server that hosts Reporting Services. By default, the wizard names the deployment folder after the project’s name—in this case, SSMRSIntro. The final wizard screen assigns a default name to the report and shows a summary of the selections from the previous screens. The initial default report name in a project is Report1. When you’re creating your own reports, you can change the default name to something more meaningful. After you close the wizard, you’re in the Visual Studio .NET report-design environment. Each report has three tabs: one to specify its data source, another for its layout, and a third to preview how it displays data. Figure 3 shows part of the Preview tab for Report1, which shows how the report will look after you deploy it. Report1 is for one specific data source, but Reporting Services lets you use parameters to vary the output in a report. Figure 3 Partial Preview tab for Report1 Creating a Drilldown Report For your second report, let’s use a shared data source instead of an embedded one, as you did to create Report1. A shared data source is useful because you can reuse it in multiple reports. Start by right-clicking Shared Data Sources in the Solution Explorer, which you see in Figure 3’s right pane, then choosing Add New Data Source to open a Data Link Properties dialog box like the one that Figure 1 shows. Complete the dialog box to specify Northwind as the data source, as you did for Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 8 Reporting Services 101 69 Report1. This process adds a new entry with the name Northwind.rds nested below Shared Data Sources in the Solution Explorer. Open the Report Wizard by right-clicking Reports in the Solution Explorer and choosing Add New Report. In the Select the Data Source screen, the wizard automatically selects Northwind as the database, referring to the Northwind.rds shared data source. If you had more than one shared data source, you could open the Shared Data Source drop-down box and select another shared data source. For the second report, enter the same query that you used for Report1 and select a tabular report style. In the Design the Table screen, add Country to the Group list box, and add CompanyName, ContactName, and Phone to the Details list box. Because you selected an item for the Group list box, a new screen called Choose the Table Layout appears before the Choose the Table Style screen. The table layout screen includes a check box called Enable drilldown. (You must select the Stepped button to make the Enable drilldown check box available.) Select Enable drilldown so that CompanyName, ContactName, and Phone column values will appear only after a user drills down to them by expanding a Country column value. Click Finish, and accept Report2 as the second report’s name. Figure 4 shows how Report2 looks in the Preview tab. Clicking the expand icon (+) next to a country name drills down to the fields nested within the group value and changes the + to a -. Notice that in Figure 4, you can view the CompanyName, ContactName, and Phone column value for the customers in Mexico, but not for either of the other two countries. Clicking the expanders for either of the other two countries will expose their hidden nested column values. Figure 4 Viewing Report2 in the Preview tab Brought to you by Microsoft and Windows IT Pro eBooks 70 A Jump Start to SQL Server BI Deploying a Solution In Reporting Services, deploying a solution is the process of publishing the reports, shared data sources, and related file items from a Visual Studio .NET project to a folder on a Report Server. Administrators can set permissions to restrict user access to reports and other solution items (e.g., shared data sources) on a Report Server. When you right-click a project in the Solution Explorer and invoke the Build, Deploy Solution command from a Visual Studio .NET project, you publish items from a solution to a folder on a Report Server. The first time you run the Report Wizard, the folder’s name and the Report Server URL appear on the Choose the Deployment Location screen. If the folder’s name doesn’t exist on a Report Server when a report author invokes the Build, Deploy Solution command, Report Server creates a new folder. You can view and update the deployment folder and Report Server URL settings from a project’s Property Pages. Right-click the project name in the Solution Explorer pane and choose Properties to open a project’s Property Pages dialog box. The TargetFolder setting corresponds to the deployment folder for a project, and the TargetServerURL setting contains the URL for the Report Server that hosts a solution’s target folder. Figure 5 shows the Property Pages dialog box for the SSMRSIntro example project. Alternatively, you can change a report’s deployment location by using the Reporting Services Report Manager application after you publish the report. Figure 5 Property Pages dialogue box Viewing Deployed Solution Items After you deploy reports and related items from a project to a Report Server, you can view them in one of two ways. First, you can use URL access to read the contents of reports with read-only permissions. Second, you can invoke Report Server for a richer mix of capabilities, including Reporting Services administration. Both approaches require a Windows account on the local Windows server or a Windows account from another trusted Windows server. Administrators have unlimited permissions, including assigning users to predefined and custom roles with permissions to perform tasks, such as reading a report. Brought to you by Microsoft and Windows IT Pro eBooks Section I: Essential BI Concepts — Chapter 8 Reporting Services 101 71 Connecting to Report Server through URL access. You can connect to a Report Server by navigating to its URL address from any user account that has permission to connect to it. For example, the IIS server hosting the Reporting Services Report Server in my office is called cab233a. Other computers in my office can connect to the Report Server at the URL http://cab233a/ReportServer. A user who has an authorized user account can navigate a browser to this URL and view a page showing links to folders on the Report Server. The link for the SSMRSIntro folder opens a Web page containing links for the two example reports in this chapter and the shared data source. The links are named after the item names in the SSMRSIntro project; the Report1 link opens Report1 in the browser. Figure 6 shows an excerpt from the URL-accessed view of Report1. Notice that the report appears the same as it does in Figure 3, but the Address box shows a URL that contains a command to render the report (rs:Command=Render). In addition, the Select a format drop-down box near the top of the pane lets users save the report in a variety of useful formats. For example, selecting Acrobat (PDF) file from the drop-down box lets users save a local copy of the report in PDF format for offline use. Figure 6 Excerpt from URL-accessed view of Report1 Brought to you by Microsoft and Windows IT Pro eBooks 72 A Jump Start to SQL Server BI Invoking Report Server. Users who have appropriate permissions can connect to the Report Server by navigating to http://servername/reports. For this chapter’s examples, the server name is cab233a. Figure 7 shows a connection to the cab233a Report Server and a folder list in the Home folder. Clicking any folder (e.g., SSMRSIntro) in a Home folder reveals the clicked folder’s contents. Users can use the Report Server folders to perform tasks according to the role assignments for their Windows account and any Windows groups they belong to. An administrator has all possible permissions. Report Server automatically adjusts its UI to expose permissions and items consistent with the role of each user. Figure 7 Invoking Report Server Beyond the Basics Reporting Services is Microsoft’s first entry into the enterprise reporting platform market. I like Reporting Services because it’s easy to install and use. Reporting Services will be even more tightly integrated in SQL Server 2005. Learning it now will help you later as you start learning SQL Server 2005. As you work with Reporting Services you’ll discover that its capabilities go far beyond what I cover in this tutorial, but you can use the information in this chapter as a first step to expanding your enterprise reporting capabilities. Brought to you by Microsoft and Windows IT Pro eBooks 73 Section II BI Tips and Techniques Brought to you by Microsoft and Windows IT Pro eBooks 74 A Jump Start to SQL Server BI Improve Performance at the Aggregation Level You can improve OLAP performance when you set a cube’s aggregation level. When you build a cube, you set the aggregation level according to the desired speedup in processing queries. (Speedup describes how much faster queries run with precreated aggregations than without aggregations.) The system estimates the speedup based on the I/O amount that the system requires to respond to queries. The total possible number of aggregations is the product of the number of members from each dimension. For example, if a cube has two dimensions and each dimension has three members, then the total possible number of aggregations is 9 (3 x 3). In a typical cube, the number of aggregations possible is extremely large, so calculating all of them in advance isn’t desirable because of the required storage space and the time it takes to create the aggregations. Imagine a cube with four dimensions, each with 10,000 members. The total possible number of aggregations is 1016. When you tell SQL Server 7.0 OLAP Services to calculate aggregations for a 20 percent speedup, OLAP Services picks key aggregations (which are distributed across the cube) to minimize the time required to determine any other aggregations at query time. —Russ Whitney Using Children to Automatically Update Products Let’s say you want to write an MDX query that shows sales for all hot beverage products for each month of the year. That task sounds simple enough, but what if you add and remove products from your product list each month? How would you write the query so you don’t have to update it every time you update your list of products? Here’s a trick to help: Use the descendants or children function. The example query that Listing 1 shows uses both of these functions. Try running Listing 1’s query in the MDX Sample program. The descendants and children functions are powerful. —Brian Moran and Russ Whitney Listing 1: Code That Uses the Descendants and Children Functions SELECT Descendants([Time].[1998],[Time].[Month]) ON COLUMNS, [Product].[AllProducts].[Drink].[Beverages]. [Hot Beverages].Children ON ROWS FROM Warehouse Saving DTS Information to a Repository To save Data Transformation Services (DTS) information into the Microsoft Repository, choose SQL Server Repository as the location for saving the package. Then, use the Advanced tab on the Package Properties to set the scanning options, which Figure 1 shows. Doing so causes DTS to call the OLE DB scanner to load all source and target catalogs into the Repository. If you don’t set the scanning options, DTS creates DTS Local Catalogs as the reference for all source and target catalogs, which can make locating the databases impossible. Each subsequent save replicates this reference, so you can’t keep comments and other descriptive information updated. Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 75 Figure 1 Package Properties Advanced tab You can run into problems when you try to save certain DTS transformations to a repository. If you use a script to perform a simple transformation and you choose the source columns explicitly (not from a query), all the transformation data is captured, as you can see in the transformation model in “The Open Information Model,” March 2000, InstantDoc ID 8060. If you choose a query as the transformation source, that source becomes objects that aren’t part of the OLE DB imported data. This choice makes following the connection back to the true source objects difficult. Also, the query isn’t parsed to create a connection between the query columns and the columns you select the data from. So in many cases, the connection between source and target is available, but in some, it isn’t. You can solve these problems by writing a program to resolve the references in a repository or by using a custom model along with the DTS model to store the source target mappings. —Patrick Cross and Saeed Rahimi Intelligent Business I knew nothing about business intelligence (BI) until I sat through a session about a new feature tentatively called the d-cube (for data cube) during the developer’s conference several years ago for the beta version of SQL Server 7.0 (code-named Sphinx). The d-cube feature appeared in SQL Server 7.0 as OLAP Services, which evolved into Analysis Services in SQL Server 2000. At the time, I was sure that OLAP Services would immediately revolutionize the database world. In a nutshell, Microsoft’s BI tools are all about letting the right people ask the right questions at the right time, then applying the answers to achieve competitive advantage. You’d think everyone would be using OLAP by now, but most organizations haven’t yet applied modern OLAP techniques to their decision making. In fact, many still have no idea what OLAP is. The adoption of BI as a mainstream approach to problem solving has been much slower than I originally anticipated. However, I believe that the Brought to you by Microsoft and Windows IT Pro eBooks 76 A Jump Start to SQL Server BI adoption rate is beginning to pick up and that more companies will embrace BI for competitive advantage. After all, who doesn’t want to make better decisions? I firmly believe that Analysis Services is an opportunity-packed specialty for SQL Server professionals, and I’m putting my money where my mouth is. I’m not going to let my core skills in SQL Server development rust away, but I do plan to spend most of my R&D time this year focusing on becoming a hard-core Analysis Services expert. Implementing successful OLAP solutions can have a tremendous impact on your client’s bottom line, which is fulfilling for a database professional. But most important, I think the demand for skilled Analysis Services engineers will far exceed the supply, which is great for my wallet. I’ve found that learning the basics of Analysis Services is relatively simple. The hardest tasks to master are modeling data multidimensionally (you’ll need to forget many of the databasenormalization lessons you’ve learned over the years) and using MDX to query the data (MDX is a rich query language, but it’s much harder to learn and master than SQL). You’ll need to start somewhere if you’re intent on becoming an Analysis Services pro. I suggest you start by attempting to master MDX. As the market for Analysis Services experts grows, the demand for your skills is sure to follow. —Brian Moran Techniques for Creating Custom Aggregations Custom rollup techniques can solve a variety of problems, but a couple of alternative techniques also let you create custom aggregations. For example, if you need to define an algorithm for aggregating one or more measures across all dimensions but the basic Analysis Services aggregation types won’t do, you can use either a calculated member in the measure’s dimension or a calculated cell formula that you limit to one measure. Both of these techniques are powerful because you use MDX formulas, which are flexible and extensive, to define them. Calculated cells are possibly the most powerful custom aggregation tool because they control the way existing (noncalculated) dimension members are evaluated and you can limit their effects to almost any subset of a cube. —Russ Whitney Using Loaded Measures to Customize Aggregations A common technique for customizing the way you aggregate a measure is to define a calculated measure based on a loaded measure, then hide the loaded measure. For example, you might aggregate a Sales measure as a sum, but in two dimensions, you want to aggregate the measure as an average. In the measure definition, you can specify that a measure be named TempSales and be loaded directly from the Sales column in the fact table. You can mark this measure as hidden so that it’s invisible to OLAP client applications; then, you can use TempSales in a calculation without TempSales being available to your analysis users. You can then use Analysis Manager to create a new calculated measure named Sales that will return the unmodified TempSales value except when you want the value to be an average of TempSales. This technique of creating calculated measures and hiding loaded measures is common in SQL Server 7.0 OLAP Services implementations because OLAP Services doesn’t support calculated cells or custom rollup techniques. However, calculated measures in both SQL Server 2000 and 7.0 have several drawbacks. For example, you can’t use calculated measures when writing back to cube cells. One reason Analysis Services and OLAP Services won’t let you write back to a calculated measure is Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 77 that a calculated measure doesn’t map directly back to a column in the fact table, so Analysis Services doesn’t know which loaded measure to modify. Consequently, calculated cells aren’t particularly useful in budgeting or modeling applications. Another drawback of calculated members is that you can’t use them with the MDX AGGREGATE function. The AGGREGATE function is a common MDX function that you use to aggregate a set of members or tuples. The measure you identify in the set determines the aggregation method that the AGGREGATE function uses. If you use a calculated measure in the set, Analysis Services (and OLAP Services) can’t determine the aggregation method, so the AGGREGATE function fails. If you use a technique such as calculated cells to modify a measure’s aggregation, the AGGREGATE function works because it is based on the measure’s defined aggregation method. —Russ Whitney Caution: Large Dimensions Ahead Be very careful when dealing with dimensions. Look before you leap into Analysis Services’ very large dimension feature, which places large dimensions into a separate memory space. This feature is buggy, so avoid it. Also be careful with Relational OLAP (ROLAP) dimensions, which the server reads into memory as needed at runtime. Because you can place a ROLAP dimension only into a ROLAP cube, performance will suffer mightily. In theory, ROLAP mode supports larger dimensions, but it’s non-functional in my experience. —Tom Chester Decoding MDX Secrets I joked recently that I wished I knew some super-secret MDX command to help solve the problem of creating virtual dimensions on the fly. Well, believe it or not, an MDX function that was undocumented in the initial release of Analysis Services provides a great solution to this problem. The MDX function is CreatePropertySet()—you use it to create a group of calculated members, one for each member-property value. The query that Listing 2 shows, which creates a group of calculated members that are children of the All Store Type member in the FoodMart Sales cube, is a simple example of how to use this function. The query creates one calculated member for each unique Member Card property value for the members of the Customers Name level. The query creates a new set, CardTypes, with the new calculated members and displays it on the rows of the result. Figure 2 shows the query’s result set. —Russ Whitney Listing 2: Query That Creates a Group of All Store Type Children WITH SET CardTypes AS ‘CreatePropertySet([Store Type].[All Store Type], [Customers].[Name].Members, [Customers].CurrentMember.Properties (“Member Card”))’ SELECT {[Unit Sales]} ON COLUMNS, CardTypes ON ROWS FROM Sales Brought to you by Microsoft and Windows IT Pro eBooks 78 A Jump Start to SQL Server BI Figure 2 The results generated by the query in Listing 2 Improve Cube Processing by Creating a Time Dimension Table Some people create a view from the fact table by using the syntax SELECT [Fact_Table].[Date] FROM [Fact_Table] GROUP BY [Fact_Table].[Date] and use the view as a source for the Time dimension. This method has a couple of drawbacks. First, it’s inefficient: The fact table is usually much bigger than the dimension table, and accessing a view of the fact table is the same as accessing the underlying base table. Another disadvantage of using the fact table as a source for the Time dimension is that the dimension won’t contain a date that had no activity. Thus, this method can create gaps in the dimension sequence by skipping weekends, holidays, and so on. If you want these gaps, remember to exclude irrelevant dates from your Time dimension table. A better way to create a Time dimension is to create a special Time dimension table in your data warehouse to hold all relevant dates. Simply create the table in Microsoft Excel, then use Data Transformation Services (DTS) to import the table into the data warehouse. This approach to creating a Time dimension significantly improves dimension and cube processing because you don’t need to query the fact table to get the Time dimension members. And if the table’s date field is of any time data type (e.g., smalldatetime), Analysis Services’ and OLAP Services’ Dimension Wizard, which you use to create dimensions, detects that the dimension could be a Time dimension and prompts you to confirm its choice, as Figure 3 shows. After you confirm that the dimension is a Time dimension, the Dimension Wizard helps you create the Time dimension’s levels (e.g., Year, Quarter, Month, Day), as Figure 4 shows. You can also define the first day and month of the year; the default is January 1. —Yoram Levin Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 79 Figure 3 Confirming a dimension type Figure 4 Creating time dimension levels Brought to you by Microsoft and Windows IT Pro eBooks 80 A Jump Start to SQL Server BI Transforming Data with DTS Data Transformation Services (DTS) is widely used as a SQL Server data-transfer tool, but in addition to simple data transfer, DTS offers the ability to perform data transformations to the data you’re transferring. The ability to perform data transformations makes DTS more versatile than most other data-transfer tools. DTS’s transformations let it perform a variety of tasks that would otherwise require custom programming. For example, by using DTS transformations, you can perform simple conversions such as converting a set of numeric codes into alphabetic codes. Or you can perform more sophisticated jobs such as turning one row into multiple rows or validating and extracting data from other database files as the transformation executes. DTS transformations are row-by-row transactions, and as such, they add overhead to the transfer process. The amount of added overhead depends mainly on how much work the transformation script must perform. Simple data conversion adds negligible overhead, while more involved transformations that require accessing other database tables add significantly more overhead. To add a custom transformation to a DTS package, click the Transform button on the Select Source Tables and Views dialog box; you’ll see the Column Mappings, Transformations, and Constraints dialog box open. Then, click the Transformations tab to display the Edit Script dialog box, which contains a VBScript template that by default includes code that copies the source columns to the destination columns. You can freely modify this template to create your own custom transformations. The code in Listing 3 shows how DTS converts the values in the column named CHGCOD from a numeric code in the source database to an alpha code in the target database. You can see that the code tests the CHGCOD column to see whether it’s equal to a 1, 2, or 3. If the code finds a 1, it writes an A to the destination table. If the code finds a 2 or 3, it writes a B or C (respectively) to the destination column. The code writes a D to the target column if it finds any other value. —Mike Otey Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 81 Listing 3: Code That Shows DTS Conversion of Numeric Code to Alpha Code Function Main() Dim nChgCod DTSDestination(“CUSNUM”) = DTSSource(“CUSNUM”) DTSDestination(“LSTNAM”) = DTSSource(“LSTNAM”) DTSDestination(“INIT”) = DTSSource(“INIT”) DTSDestination(“STREET”) = DTSSource(“STREET”) DTSDestination(“CITY”) = DTSSource(“CITY”) DTSDestination(“STATE”) = DTSSource(“STATE”) DTSDestination(“ZIPCOD”) = DTSSource(“ZIPCOD”) DTSDestination(“CDTLMT”) = DTSSource(“CDTLMT”) nChgCod = DTSSource(“CHGCOD”) If nChgCod = “1” Then DTSDestination(“CHGCOD”) = “A” ElseIf nChgCod = “2” Then DTSDestination(“CHGCOD”) = “B” ElseIf nChgCod = “3” Then DTSDestination(“CHGCOD”) = “C” Else DTSDestination(“CHGCOD”) = “D” End If DTSDestination(“BALDUE”) = DTSSource(“BALDUE”) DTSDestination(“CDTDUE”) = DTSSource(“CDTDUE”) Main = DTSTransformStat_OK End Function Supporting Disconnected Users A common shortcoming of analytic applications is that they can’t support mobile or disconnected users. Because analytic applications are complex, developers move more application functionality to Web browser–based UIs, which use dynamic HTML (DHTML) and JScript to minimize the amount of application code that workstations download. Unfortunately, disconnected workstations (e.g., laptops) can’t run this limited code without a network connection. Because I’m one of those mobile users, I appreciate applications that I can use whether or not I’m connected. The number of users like me is growing; more workers in the enterprise are using laptops instead of desktop computers. Managers, especially, rely on mobility, and they’re heavy consumers of analytic applications. To support disconnected users, developers need to enable users to take part or all of an application with them. I don’t have a solution that will make a fancy DHTML Web application run well on a disconnected laptop. But I can tell you about a new feature in SQL Server 2000 Analysis Services that makes supporting disconnected users easier: local-cube Data Definition Language. DDL provides a simple way to create local-cube files in Analysis Services through MDX. These local-cube files let you put part or all of the data from a server-based cube onto a laptop. You can then use the local-cube file to perform the same analysis that you could if you were connected to the OLAP server on a Brought to you by Microsoft and Windows IT Pro eBooks 82 A Jump Start to SQL Server BI network. To create a local cube without this new MDX syntax, you must construct a verbose SQL-like statement and pass it to ADO through a connection string. Local-cube DDL is superior to the old connection-string method for three reasons. First, the shortcuts in the DDL syntax make using it simpler than repeating all the details of the server-based cube to create a local cube with the same dimension structures. Second, most OLAP applications don’t give users the ability to customize the connection string to the required degree, so developers created custom applications to provide the CREATECUBE functionality. Third, a variation of the new DDL can create session- scoped temporary cubes —Russ Whitney Dependency Risk Analysis Many businesses use a type of analysis called dependency risk analysis. This type of analysis determines whether one group of items in your business (e.g., products) is overly dependent on just one item of another group (e.g., customers). Retailers describe an overly dependent item as at risk. For example, you might want to find out which products depend most on a single customer. To answer this question, you need to find what percentage of total store sales for each product comes from a single customer. To test yourself, find the top 10 highest risk products, and show the percentage and amount of the product’s sales that are at risk. Listing 4 shows a query that defines two new measures. One measure calculates the total of Store Sales for the selected product (e.g., you might want to find the total sales to that product’s top customer). The other measure calculates the percentage of the product’s total sales that’s at risk. The MDX query in Listing 4 uses the PercentAtRisk measure to find the 10 products with the highest percentage of Store Sales at risk. The query then displays both the amount at risk and percentage at risk for each of the top 10 products. —Russ Whitney Listing 4: Query That Defines Two New Measures WITH MEMBER [Measures].[AmountAtRisk] AS ‘ SUM( TOPCOUNT([Customers]. [Name].MEMBERS, 1, [Store Sales]), [Store Sales] )’ MEMBER [Measures].[PercentAtRisk] AS ‘ [AmountAtRisk] / ([Store Sales], [Customers].[All Customers] )’, FORMAT_STRING = ‘#.00%’ SELECT { [AmountAtRisk], [PercentAtRisk] } ON COLUMNS, TOPCOUNT( [Product].[Product Name].MEMBERS, 10, [PercentAtRisk] ) ON ROWS FROM Sales Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 83 Choosing the Right Client for the Task The lessons our development team learned from building a Web-based analysis system can provide a valuable guide for deploying OLAP solutions in an enterprise. Microsoft Excel provides a capable, familiar client that you can deploy in a LAN but requires realtime connectivity to the OLAP server. Office Web Components (OWC) works well for deploying an Analysis Services client in an intranet because you can easily control the client platform and open ports securely in an intranet. The Analysis Services Thin Web Client Browser provides a good Internet solution when firewalls are in place and you want minimal impact on the user OS. For any development project, you need to understand the business requirements and needs of the people who will use the products you develop. By outlining requirements and weighing all the options, you can discover the right solution to satisfy your client’s requirements. —Mark Scott and John Lynn Using Access as a Data Source To analyze a relational data source, you need to first publish it as a data source in the Windows 2000 or Windows NT environment by establishing a data source name (DSN). To set up Microsoft Access as a data source, start by accessing the Data Sources (ODBC) settings in Windows NT through the Start menu under Settings, Control Panel. In Windows 2000, choose Start, Settings, Administrative Tools. Double-click to open the Data Sources (ODBC), then select the System DSN tab. Click Add; in the Create New Data Source window, select Microsoft Access Driver (*.mdb). Click Finish to display the ODBC Microsoft Access Setup dialog box. Under Data Source Name, enter the name you choose for your Access data source. In the Setup Wizard’s Database section, click Select. In the Select Database dialog box, browse to find the database, select it, then click OK. To finish the source-data selection sequence, click OK in the ODBC Microsoft Access Setup and the ODBC Data Source Administrator dialog boxes. —Frances Keeping Calculating Utilization One of the most common measurements of group or individual performance in a consulting agency is utilization. Decision makers in consulting groups calculate utilization by dividing the total number of hours billed by the total number of business hours available (e.g., 8 hours for each business day). Having a high percentage of utilization is good because it means you’re effectively deploying available resources to generate revenue. You can use the cube structure that Figure 5 shows to create the MDX for a measure that calculates utilization for a selected employee and time period. The query in Listing 5 calculates utilization as a percentage of hours worked. The meat of the formula is in the definition of the calculated measure, AvailableHours. AvailableHours multiplies the number of work days in the selected time period by 8 hours. You get the total number of work days by eliminating weekend days and holidays from the total number of calendar days. The Utilization measure then divides the total work hours by the available work hours to get a percentage. The result is a percentage that can be more than 100 percent if the average number of work hours exceeds 8 hours per day. —Russ Whitney Brought to you by Microsoft and Windows IT Pro eBooks 84 A Jump Start to SQL Server BI Figure 5 Cube structure to create MDX Listing 5: Query That Calculates Utilization as a Percentage of Hours Worked WITH MEMBER [Measures].[AvailableHours] AS ‘COUNT( FILTER( DESCENDANTS( [Time].[Project].CURRENTMEMBER, [Time].[Project].[Day] ), ([Time].[Project].CURRENTMEMBER.PROPERTIES(“Weekend”) = “0”) AND ([Time].[Project].CURRENTMEMBER.PROPERTIES(“Holiday”) = “0”))) * 8’ MEMBER [Measures].[Utilization] AS ‘ [Hours] / [AvailableHours]’, FORMAT_STRING = ‘#.0%’ SELECT {[Hours], [Utilization], [AvailableHours] } ON COLUMNS, [Time].[Project].[All Time].[2002].CHILDREN ON ROWS FROM Tracker WHERE ([Employee].[All Employee].[Admin].[rwhitney]) Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 85 Use Member Properties Judiciously When you start up the OLAP server, the server loads every dimension—including member keys, names, and member properties—into server memory. Because Analysis Services is limited to 3GB of RAM, this is one of the primary bottlenecks for enterprise-scale deployments. For this reason, limit member properties to the bare essentials, particularly when the level has lots of members. —Tom Chester Get Level Names Right from the Get-Go When you first build a dimension, level names default to the same names as the column names in the dimension table (except that Analysis Manager replaces special characters with spaces). This means that you wind up with level names like Cust Code, or worse. Then, after the cube is processed, you can’t change the level names without reprocessing the dimension, which in turn requires that you reprocess the cube. Because it’s painful to rename levels after the cube is processed, many cubes go into production with frighteningly cryptic level names. To compound matters, MDX formulas are often written with dependencies on the unfriendly level names, adding another hurdle to the level-rename task. Cubes are supposed to be easily usable right out of the box, so avoid this pitfall by getting the level names right from the beginning. As soon as you build a dimension, change the default level names to user-friendly names before placing the dimension into the cube. —Tom Chester Aggregating a Selected Group of Members Sometimes you need to aggregate a group of dimension members for one query. For example, suppose you want to return Unit Sales for the quarters of 1997 for each product family. The solution is easy. But what if you want to run the same query for only the customers in California and Oregon, leaving out Washington? This is a common problem with a simple solution. All you have to do is create a calculated member that aggregates California and Oregon, and select that calculated member in the WHERE clause, as Listing 6 shows. Listing 6: Code That Creates a Calculated Member and Selects It in a WHERE Clause WITH member [Customers].[All Customers].[USA].[CA-OR] AS ‘ Aggregate({[Customers].[All Customers].[USA].[CA], [Customers].[All Customers].[USA].[OR] })’ SELECT [Time].[1997].Children ON Columns, [Product].[Product Family].Members ON Rows FROM [Sales] WHERE ( [Customers].[All Customers].[USA].[CA-OR], [Unit Sales] ) The Aggregate function aggregates the set of members passed to it and uses the Aggregation method defined for the member’s dimension. In this case, the Customers dimension is aggregated with a Sum function that we defined in the OLAP Manager when we built the cube, so the new dimension member [CA-OR] is the sum of [CA] and [OR]. This tip is useful, but be careful. Performance can suffer if you use aggregation heavily in the WHERE clause. If you have a common alternative aggregation, you might be better off creating a second hierarchy for your dimension. —Brian Moran and Russ Whitney Brought to you by Microsoft and Windows IT Pro eBooks 86 A Jump Start to SQL Server BI Determining the Percentage of a Product’s Contribution A common business problem is determining percentage of contribution to a group. For example, you might want to know what percentage of the total revenue of a product line a particular product contributed, or what percentage of growth of sales in a country each region in that country contributed. Here’s one way to solve this problem: For each revenue or dimension combination you want to analyze, create a new calculated member. For instance, if you want to analyze Store Sales as a percent of a group in the Product dimension, create a member, as Listing 7 shows. Figure 6 shows the result set from this query. —Brian Moran and Russ Whitney Listing 7: Code That Creates a Calculated Member CREATE MEMBER [Sales].[Measures].[Store Sales Perc] AS ‘ (Product.CurrentMember, [Store Sales]) / (Product.CurrentMember.Parent, [Store Sales])’ -- The preceding code lets you write the following simple MDX query: SELECT { [Store Sales Perc] } ON COLUMNS, [Drink].children ON ROWS FROM Sales Figure 6 The results generated by the query in Listing 7 Avoid Crippled Client Software Can you imagine using a front-end tool for a relational database management system (RDBMS) that doesn’t let you specify an SQL statement? Of course not. Yet somehow that’s what developers are faced with in the OLAP space. Remarkably, many shrink-wrap query and reporting tools that work with Analysis Services are crippled in a fundamental sense—they don’t let developers supply an MDX SELECT statement. The problem is this: None of the commercial clients, even the most robust, come close to exposing the full power of MDX. Maybe simple cube browsing is all your users require. Nonetheless, to avoid painting yourself into a corner, choose a front-end tool that lets the developer specify custom MDX SELECT statements. There’s a catch to this advice, however. The client tools that don’t expose MDX tend not to be tightly bound to Analysis Services—they provide connectivity to other data sources. However, I don’t think it’s asking too much for these query- and reporting-tool vendors to expose an MDX SELECT query string as a pass-through. —Tom Chester Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 87 Setting OLAP Cube Aggregation Options After you create an OLAP cube and choose the storage technique that’s optimal for your situation, the OLAP server designs the aggregations and processes the cube. If you choose the Relational OLAP (ROLAP) storage technique, the OLAP server will create the summary tables in the source database after it processes the cube. Otherwise, aggregations are stored in OLAP server native format. You can choose the degree of aggregation by considering the level of query optimization you want versus the amount of disk space required. Figure 7 shows the Storage Design Wizard. For example, I chose 80 percent performance, which produced 124 aggregations and required 22.5MB of storage space for Multidimensional OLAP (MOLAP) storage. The aggregations roll up, so if you choose low performance in favor of conserving disk space, the OLAP server query engine will satisfy queries by summing existing aggregations. —Bob Pfeiff, Tom Chester Figure 7 Storage Design Wizard Use Views as the Data Source Always use views as the data source for dimension tables and fact tables. In addition to providing a valuable abstraction layer between table and cube, views let you leverage your staff’s expertise with relational database management systems (RDBMSs). When you use a view as a fact table, you can manage incremental updates by altering the WHERE clause within the view instead of assigning the WHERE clause to an OLAP partition. When you use a view to source a dimension, you can define logic inside the view that otherwise would have to be defined in Analysis Services (e.g., formulated member names, formulated member properties). —Tom Chester Brought to you by Microsoft and Windows IT Pro eBooks 88 A Jump Start to SQL Server BI Enter Count Estimates When you first build a dimension, Analysis Services stores the member count for each level as a property of the level. This count is never updated unless you explicitly update it (manually or by using the Tools, Count Dimension Members command). In addition, it’s typical for cubes to initially be built against a subset of the data warehouse. In this case, the cube will likely go into production with the count properties understated by an order of magnitude. Here’s the gotcha: The Storage Design Wizard uses these counts in its algorithm when you’re designing aggregations. When the counts are wrong, the Storage Design Wizard is less effective at creating an optimal set of aggregations. The solution is simple—when you build the dimension, manually enter estimated counts for each level. Using Dynamic Properties to Stabilize DTS To cut down on coding and thereby minimize errors, Microsoft added the Dynamic Properties task to Data Transformation Services (DTS) in SQL Server 2000. With the assistance of this task, you don’t have to create bulky ActiveX Script tasks to dynamically set a DTS property, such as a username that you use to establish a connection. This task lets you change the value of any nonidentifying property that’s accessible through the DTS object model (e.g., non-name/ID properties of a step, connection, task, package, or global variable). What once took 3 weeks to stabilize, you can now write and stabilize in less than a day. Using the Dynamic Properties task gives you faster performance than writing the same process with an ActiveX Script task because DTS doesn’t resolve the ActiveX Script task until runtime. —Brian Knight Leave Snowflakes Alone Analysis Services lets you source dimensions from either a normalized snowflake schema or a flattened star schema. Microsoft recommends flattening snowflake dimensions into stars for performance reasons, a practice that most Analysis Services developers follow. However, unless the relational data mart is consumed by something other than Analysis Services, this practice has few benefits and considerable drawbacks. For these reasons, resist the urge to flatten. A snowflake schema provides the benefits of a normalized design. With a star schema, managing attributes for the repeating non-leaf members is awkward at best. A snowflake gives you unique keys at each level. This lets you import data into a cube at any level of granularity, a critical ability in financial-planning applications, for example. Because dimension tables aren’t queried at runtime (except for in the notoriously slow relational OLAP—ROLAP—mode), snowflake dimensions have no impact on query performance. The only downside to a snowflake dimension is that it (the dimension, not the cube) is slower to process than a star because of the joins that are necessary. However, the time it takes to process dimensions is a minor factor compared to the time necessary for cube processing. Unless the dimension is huge and the time window in which processing must occur is tight, snowflakes are the way to go. —Tom Chester Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 89 Create Grouping Levels Manually No dimension member can have more than 64,000 children, including the All member. This limit isn’t as onerous as it sounds; usability is apt to nail you before the hard limit does. A member with even 10,000 children usually presents a usability problem—that’s a lot of rows to dump on a user drilling down into the dimension. Whether you’re fighting the limit or simply working to design your dimension so that it provides bite-size drilldowns, the solution is to build deep, meaningful hierarchies. But when there’s no raw material from which to build a meaningful hierarchy, you must resort to a grouping level, aka a Rolodex level, such as the first letter of the last name for a customer dimension. Analysis Services has a feature (create member groups in the Dimension Wizard) that can create a grouping level for you automatically. Don’t use it! You won’t have control over the grouping boundaries. Instead, construct the level manually. This entails adding a new level to the dimension, then modifying the Member Name Column and Member Key Column properties. For instance, you might define the member key column and member name column for the grouping level as follows: LEFT(“CustomerDimTable”.” CustomerName”, 1) This expression bases the level on the first letter of the customer name, providing Rolodex-style navigation. Bear in mind, however, that this is a SQL pass-through; the expression is passed to the RDBMS, so the RDBMS dictates the syntax. That is, T-SQL has a LEFT() function, but another RDBMS might not. —Tom Chester Understand the Role of MDX Did you ever try to swim without getting wet? For all but the simplest of databases, that’s what it’s like when you try to design an OLAP solution without using MDX. Because shrink-wrap client software often negates the need to write MDX SELECT statements, many developers think they can successfully avoid MDX. This is folly. Sure, not every project requires MDX SELECT statements; commercial software is adequate for many situations. But MDX calculations should play an important role in most Analysis Services solutions, even those that aren’t calculation-intensive on the surface. Perhaps the most common example is a virtual cube that’s based on two or more source cubes. Calculated members are usually required to “glue” the virtual cube together into a seamless whole. Although the MDX isn’t necessarily complex, developers unaware of the role of MDX wind up making costly mistakes. Either they avoid virtual cubes entirely, or they shift logic that’s easily implemented in MDX to the extraction-transformation-load (ETL) process, where it’s more complicated and rigidly set. —Tom Chester Using NON EMPTY to Include Empty Cells Many multidimensional cubes have empty cells, which occur because a user didn’t load data into the cube for these members. For example, if you inspect the Sales cube in the FoodMart sample, its creators didn’t load any data for 1998. You must use the NON EMPTY modifier to write an MDX query that includes 1998, as Listing 8 shows. —Brian Moran and Russ Whitney Brought to you by Microsoft and Windows IT Pro eBooks 90 A Jump Start to SQL Server BI Listing 8: MDX Code That Uses NON EMPTY to Include 1998 SELECT NON EMPTY {[Time].[1997], [Time].[1998]} ON COLUMNS,[Promotion Media].[Media Type].Members ON ROWS FROM Sales Formatting Financial Reports When creating a financial report such as an income statement, you need to display the subtotals (which are parent members of a dimension) at the bottom of the statement—after the details (which are children). You can use the MDX Hierarchize() function with the POST option to force parent members to appear after their children. The following example shows the Hierarchize() function on the FoodMart 2000 Sales cube: WITH SET MySet AS ‘{CA,CA.Children,[OR],[OR].Children}’ SELECT Hierarchize(MySet, POST) ON Columns FROM sales WHERE [Sales Count] How can you change this query to sort the set MySet in ascending order while making sure the parents appear after their children? Thanks to Shahar Prish of Microsoft for providing the clever answer that Listing 9 shows. First, he sorted the items in descending order while preserving peer groupings (i.e., keeping children of a common parent together). Then, he used a Generate() function to reverse the order of the set. The result maintains the peer groupings, keeps the items in ascending order, and places the parents after the children. Notice that Shahar uses the AS keyword to name the sorted set MySetIterator. He also uses the Count and Current properties on the named set. —Russ Whitney Listing 9: Making Parents Appear After Their Children WITH SET MySet AS ‘{CA,CA.Children,[OR],[OR].Children}’ SELECT Generate(ORDER(MySet,([Sales Count],[1998]),DESC) AS MySetIterator,{MySetIterator.Item(MySetIterator.CountRank(MySetIterator.Current.Item(0),MySetIterator))}) ON 0 FROM sales WHERE [Sales Count] Analyzing Store Revenue Retail businesses sometimes evaluate their stores’ performance by analyzing revenue per square foot or cost per square foot. Use the FoodMart 2000 Sales cube to determine what the total store square footage is for each Store State (Store State is a level in the Store dimension). Note that what makes this problem unique is that square footage is stored as a member property in the Store dimension. Listing 10 shows a query that solves this problem. This query is interesting because it demonstrates how to aggregate (sum) a numeric member property. The query creates a new measure that returns the value of the Store Sqft member property. If the selected store member is above the Store Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 91 Name level in the dimension, the query sums all store square footage values below the selected member in the hierarchy to determine the square footage value. Because the MDX treats member properties as strings, the query uses the VBScript function VAL() to convert the member property from a string to a number before summing all the store square footage values. —Russ Whitney Listing 10: Query That Returns the Value of the Store Sqft Member Property WITH MEMBER [Measures].[StoreSqft] AS ‘IIF(([Store].CurrentMember.Level.Name=”Store Name”), VAL([Store].CurrentMember.Properties(“Store Sqft”)), SUM(DESCENDANTS([Store].currentmember,[Store].[Store Name]), VAL([Store].CurrentMember.Properties(“Store Sqft”))))’ SELECT {[Measures].[StoreSqft]} ON COLUMNS, FILTER(DESCENDANTS([Store].[All Stores],[Store].[Store State]), [Measures].[StoreSqft]>0) ON ROWS FROM Sales Use Counts to Analyze Textual Information You can analyze any database—with or without numeric information—by using counts. Count measures can be easy or complex. In the FoodMart database, setting up the Sales Count measure is simple. You just create a new measure based on the fact table’s primary key and set the measure’s type to count. But let’s look at a more complex example. Say a table called fact contains line items (entries) for invoices. An invoice can contain one or more entries. So you probably want a count measure that counts invoices, not entries. To count invoices, you want to count only the groups of entries that make up an invoice. One way to solve this problem is to create a distinct count measure based on the fact table’s invoice number column. This measure will give you the invoice count values you want, but distinct count measures have two serious limitations. First, each cube can contain only one distinct count measure. Second, Analysis Services doesn’t aggregate distinct counts through dimension levels as it does other measures. The distinct count values are predetermined and stored at each cell with other measures, so you can’t create new aggregations during an MDX query’s execution. In an MDX query or a calculated member definition, using a function such as Aggregate( [USA], [Mexico] ) won’t work with a distinct count measure selected; determining the result of the function would require rescanning the fact table because the cube doesn’t contain enough information to determine the function’s result. Analysis Services can’t rescan the source table, but even if it could, the process would be prohibitively slow. The effect of this restriction is that distinct count measures don’t generally work well with other calculated members or sets. A second solution is to create an extra column in the source table to store an invoice count. Fill one entry for each invoice with a value of 1; fill all other instances of the invoice count field with values of 0. You can then create an Invoice Count measure that’s a sum of this invoice count column. Brought to you by Microsoft and Windows IT Pro eBooks 92 A Jump Start to SQL Server BI This solution works as long as you select in the cube a cell that encompasses a group of entries that make up complete invoices. If your cell includes only some of the entries in an invoice, the invoice count column might not include the entry that contains the 1 value and thus would produce a sum of 0 instead of 1 for that invoice. A third solution is to use a combination of the two approaches. Create an invoice distinct count measure, an invoice sum count measure, and an invoice count calculated measure that helps you determine which of the other two measures to use based on the cell that’s selected. The invoice distinct count measure will return the correct answer when only some of the entries in an invoice are selected, and the invoice sum count will work in all other situations. The invoice sum count also gives you the benefit of working when custom combinations of members are selected. This invoice count example shows that, in real-world situations, count measures can get complicated because the count might depend on a distinct combination of a group of fact table columns. —Russ Whitney Consolidation Analysis A common type of retail sales analysis is consolidation analysis. One example of consolidation analysis is customer consolidation analysis: If fewer customers are buying more products, your customers are consolidating. If more customers are buying fewer products, your customers aren’t consolidating. In the FoodMart 2000 Sales cube, you can use the Store Sales measure to determine the top 10 customers. Then, you can write an MDX query to determine whether the top 10 FoodMart customers are consolidating throughout the four quarters of 1997. But first, you need to create an MDX query that includes the four quarters of 1997 on the columns of the query’s result. Then, create two rows. The first row should display the total number of store sales that the top 10 customers purchased. The second row should display the percentage of total store sales that the top 10 customers purchased. Listing 11 shows the code that produces this result. Listing 11: Code That Displays Purchase Information for FoodMart’s Top 10 Customers WITH SET [Top 10] AS ‘TOPCOUNT( [Customers].[Name].Members, 10, ([Customers].CURRENTMEMBER, [Store Sales] ) )’ MEMBER [Measures].[Top 10 Amount] AS ‘Sum([Top 10], [Store Sales])’ MEMBER [Measures].[Top 10 Percent] AS ‘ [Top 10 Amount] / ([Customers].[All Customers], [Store Sales])’, FORMAT_STRING = ‘#.00%’ SELECT [Time].[1997].CHILDREN ON COLUMNS, { [Top 10 Amount], [Top 10 Percent] } ON ROWS FROM Sales I made this query a little easier to read by first creating a set with the top 10 customers based on Store Sales, then using this set in the other two calculated measure definitions. The first calculated measure sums the store sales for the top 10 customers to determine the store sales that the top customers are responsible for. Next, the Top 10 Percent measure determines what percentage of the total store sales comes from the top 10 customers. The query then displays both the Top 10 Amount and the Top 10 Percent for each quarter of 1997. Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 93 The query’s result shows that the top 10 customers are consolidating slightly. During first quarter 1997, the top 10 customers were responsible for 1.41 percent of all store sales; during fourth quarter 1997, that group accounted for 1.77 percent of store sales. —Russ Whitney Working with Analysis Services Programmatically Analysis Services has three programmatic interfaces that you can use from your analysis application. The first two interfaces are client-side interfaces: ADO MD and OLE DB for OLAP. Both of these programming interfaces offer functionality for data exploration, metadata exploration, write-back capabilities, and read-only analysis functions. Only the write-back capabilities affect the contents of the cube that other users in the system share. If you want to make other types of changes to Analysis Services data programmatically, you have to use the administrative programming interface Decision Support Objects (DSO). DSO lets you create and alter cubes, dimensions, and calculated members and use other functions that you can perform interactively through the Analysis Manager application. —Russ Whitney Filtering on Member Properties in SQL Server 7.0 Even if your OLAP client application doesn’t support Member Properties, you can still filter based on their values by using the virtual dimensions feature of SQL Server 7.0 OLAP Services. Virtual dimensions expose Member Properties as another dimension in which the members of the dimension are the individual values of the Member Property. After you’ve defined a Member Property in OLAP Manager, you can use that property as the basis for a virtual dimension. For example, the Store Size in the SQFT dimension in the FoodMart sample database is a virtual dimension based on the Store Sqft Member Property in the Store Name level of the Store dimension. By using OLAP Manager, you can tell the difference between a real dimension and a virtual dimension by looking at the icon in the cube editor. Figure 8 shows the three virtual dimensions based on Member Properties of the Store Name member. Virtual dimension icons have a small calculator as part of the image. Virtual dimensions include all the unique values of the underlying Member Property as dimension members, and these members aggregate to an ALL member. Thus, virtual dimensions have only two hierarchical levels. In the client application, the virtual dimensions show up as just another dimension and don’t significantly increase the size of the cube. Unfortunately, in the current release of OLAP Services, virtual dimensions are slow compared to real dimensions. Still, virtual dimensions are worth using because they let you filter OLAP queries on Member Properties even when the client application might not directly support that capability. —Brian Moran and Russ Whitney Brought to you by Microsoft and Windows IT Pro eBooks 94 A Jump Start to SQL Server BI Figure 8 Three virtual dimensions based on Member Properties of the Store Name member Improving Query Performance When members of our DBA team were preparing our data for graphing, we executed some preliminary queries to pull data from the System Monitor, generated CounterData and CounterDetails tables, and received some interesting results. First, we found that pulling data from the default table structures was slow. Then, we added a calculated field and index to CounterData and found that queries performed significantly faster when CounterDateTime was an indexed datetime field rather than a non-indexed char(24) field. (We appreciate the assistance the SQL Server Magazine technical editors gave us in figuring this out.) But when we modified the structure of the CounterData table with the appropriate indexes and calculated fields, System Monitor wouldn’t log the data at all, although our queries performed somewhat better. It turns out that System Monitor tries to recreate the tables when it finds structural changes in them. We also tried creating an INSTEAD OF trigger to route the data entry into another table. However, when we did so, SQL Server bulk-loaded the data and ignored triggers. We thought about modifying the tables, but you can’t expect assistance from Microsoft if you change the system tables, so we recommend that you don’t alter them. In the Microsoft Platform Software Development Kit (SDK) under the Performance Monitor heading (at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/perfmon/base /performance_data.asp), Microsoft describes the fields of the CounterData table as Table 1 shows. Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 95 Table 1: Microsoft Descriptions of CounterData Table Fields Field Name GUID CounterID RecordIndex CounterDateTime CounterValue FirstValueA FirstValueB SecondValueA SecondValueB Description GUID for this data set. Refer to the description of the DisplayToID table to correlate this value to a log file. The identifier of the counter being collected. This is the key to the CounterDetails table. The sample index for a specific counter identifier and collection GUID. Each successive sample in this log file is assigned a RecordIndex value, which increases by one throughout the time the data is logged. The time the collection of the values of this counter was started, in UTC time. The formatted value of the counter. The raw performance data value used to calculate the formatted data value. Some formatted counters require the calculation of four values in the following manner: (SecondValueA FirstValueA)/(SecondValueB - FirstValueB). The remaining columns in this table provide the remaining values in this calculation. Refer to the description of FirstValueA. Refer to the description of FirstValueA. Refer to the description of FirstValueA. However, the description of CounterDateTime is incorrect. If you investigate the System Monitor tables CounterData and CounterDetails, you’ll find that the counter names are stored in CounterDetails and counter values are stored in CounterData, using one column for every counter and logged one row at a time. For example, if you logged the 12 counters for 2 minutes, CounterDetails would contain 12 records for the names of the counters, whereas CounterData would contain 24 entries for each minute the data was logged. One way to make pulling data from this format more efficient and effective is to transform the data into a pivot-table format in which one column exists for the date and time and additional columns exist for each counter whose data you want to view. Interestingly, this is the same format that a System Monitor CSV file uses. —Mark Solomon Using SQL ALIAS for the AS/400 The AS/400 supports a file concept known as multiple-member files, in which one file (or table) can possess several different members. Each member is a part of the same file or table and shares the same schema, but the members are uniquely named and have unique data. ODBC and OLE DB have no built-in mechanism for accessing multiple members. By default, ODBC always accesses the first member in a multimember file. To enable ODBC-based applications such as Data Transformation Services (DTS) to access multiple-member files, you need to use the AS/400’s SQL ALIAS statement. The ALIAS statement lets you create an alias for each member you need to access. Then, your ODBC application can access the alias, which in turn connects to the appropriate member. These SQL aliases are persistent, so you need to create them only once. The following statement shows how to create an alias: CREATE ALIAS MYLIB.FILE1MBR2 FOR MYLIB.MYFILE(MBR2) This statement creates a new alias named FILE1MBR2 for the multimember file MYFILE. The ODBC or OLE DB application then connects to that specific member, using the alias name FILE1MBR2 to access the second member in the file MYFILE. —Michael Otey Brought to you by Microsoft and Windows IT Pro eBooks 96 A Jump Start to SQL Server BI Setting Up English Query English Query works best with normalized databases; however, your circumstances might mandate a structure that isn’t fully normalized. In this case, you can use views to solve problems that nonnormalized databases cause. The English Query domain editor doesn’t automatically import views. To add a view as an entity, select Table from the Insert menu and enter the name of the view. The English Query Help file provides examples of how to use views with non-normalized data. Another tip is to define a primary key for each table in your English Query application. English Query requires primary keys to perform joins between tables to satisfy user requests. If you haven’t defined keys in your database, you need to define them in the domain editor. English Query can’t build your application correctly without primary keys. When you develop English Query applications, remember that case is significant. For example, English Query knows that you’re asking about a proper noun because you capitalize the words in the query. Finally, if you’re running Internet Information Server (IIS) 4.0 with the Windows Scripting Host (WSH), the fastest way to build and deploy a test application is to run the setupasp.vbs macro from C:\programfiles\microsoftenglishquery\samples\asp2. This macro automatically installs and configures your data, so you can start testing immediately. —Ken Miller When Do You Use Web Services? Let’s say that your company uses a supply-chain application that stores your customers’ orders in a SQL Server database and keeps track of each order’s status. Currently, when customers want to know which of their orders are pending, they contact your customer-service representative, who queries the database for that information. Customers then update their ordering systems. But suppose a customer wants to streamline the process by using an application to request order status directly from your system. To enable this type of access to your system, you and the customer need to agree on the interface the customer will use to make the request and the format in which you will return the requested data. This scenario is an ideal application for Web services because you can use SOAP to build a single standards-based interface that works for many different customers with varying needs, regardless of the software applications and computing platform their enterprises use. Additionally, SOAP lets you build a loosely coupled interface that incorporates XML as the data format. (A loosely coupled application lets you reconfigure, redeploy, or relocate the implementation without affecting dependent applications.) By using XML, you gain extensibility that lets you expand the scope of the data you can provide to your customers in the future. Simply put, supplying a Web service lets you leverage the full value of XML’s standard format, extensibility, and platform independence. —Rich Rollman The Security Connection Here’s a summary of steps you can take to optimize SQL Server security and connectivity. • Use Windows-only authentication with SQL Server. • Use trusted connections instead of strings that pass SQL Server usernames and passwords. • Put the connection objects in DLLs and put them in Microsoft Transaction Server (MTS). Brought to you by Microsoft and Windows IT Pro eBooks Section II: BI Tips and Techniques 97 • Set your code to use OLE DB instead of ODBC if you’re using ADO. With ADO, ODBC calls OLE DB, so by using OLE DB directly, you improve performance by eliminating a processing layer. • Use TCP/IP between your IIS and SQL Server machines, not the default Named Pipes, if IIS and SQL Server are on separate servers. As Microsoft article “PRB: 80004005 ConnectionOpen (CreateFile()) Error Accessing SQL” at http://support.microsoft.com/support/kb/articles/q175/6/71.asp states, “When Named Pipes are used to access the SQL Server, IIS tries to impersonate the authenticated user, but it does not have the ability to prove its identity.” • Put your connections and stored procedure calls into Visual Basic (VB) code DLLs, install them in MTS (which will automatically pool connections for you), and create server objects in VBScript to use the connections. • Ask for the correct Microsoft department if you need help using ADO-based code to talk to SQL Server. Microsoft Technical Support not only has IIS and SQL Server experts; it also has ADO-toSQL Server experts. —John D. Lambert Brought to you by Microsoft and Windows IT Pro eBooks 98 A Jump Start to SQL Server BI Section III New BI Features in SQL Server 2005 Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 1 Building Better BI in SQL Server 2005 99 Chapter 1: Building Better BI in SQL Server 2005 Since its inception, Microsoft’s SQL Server Business Intelligence (BI) team has been guided by the overriding goal of making business data usable and accessible to as many people as possible. As the team’s general manager, Bill Baker works with the people who design and develop BI tools such as Integration Services (formerly data transformation services—DTS), Analysis Services, and Reporting Services. Baker recently talked with SQL Server Magazine about SQL Server 2005’s new BI tools and how they work together to streamline delivery of business-critical information. How are SQL Server 2005’s BI enhancements meeting Microsoft’s goals for serving the BI community? And how long has your team has been working on these enhancements? Our goal since we started the SQL Server BI team has been to give as many people as possible in every organization greater insight into their business and the market. We call this “BI for the Masses,” and with every version of SQL Server and Microsoft Office, we take further steps to make BI available to every person at every level of a company. For example, in the integration space, SQL Server 2005 Integration Services delivers far greater throughput and more data warehousing and BI functionality out of the box. In addition, customers need to analyze far more data from many more sources and with more immediacy than ever before. In the analysis space, our investments in the Unified Dimensional Model (UDM) and proactive caching move SQL Server 2005 Analysis Services beyond the niche-OLAP market and into the mainstream. Our new Report Builder in SQL Server 2005 Reporting Services opens up report authoring way beyond the Visual Studio audience we support well today. Our vision is about getting the right information, in the right format, to the people who need it—when they need it. Every BI investment we make supports that goal. Initial planning for SQL Server 2005 started several years ago, but we are now starting to see the fruits of our labor and are definitely in “ship mode.” Through our beta releases and Community Technology Previews (CTPs)—advance previews into the upcoming beta—we are receiving incredible customer feedback on our features and implementations. What kind of feedback have you been getting from beta testers, and which features are they most enthusiastic about? Our customers tell us they really appreciate how comprehensive our BI solution is. Our solution not only enables seamless integration of components, but it’s cost-effective, which is essential. We are getting great feedback on the BI Development Studio—formerly called the BI Workbench—which provides one development environment for Integration Services, Reporting Services, data mining, and Analysis Services. Beta testers have also praised the integration of the BI engines into SQL Server Management Studio—formerly called the SQL Server Workbench—which combines the functionality of Enterprise Manager, Query Analyzer, and Analysis Manager into one integrated tool. Beta testers also appreciated the overall ability SQL Server 2005 gives them to deploy and manage BI applications. Brought to you by Microsoft and Windows IT Pro eBooks 100 A Jump Start to SQL Server BI According to news reports, Microsoft and some large customers have deployed SQL Server 2005 Beta 2 in production environments. What is your recommendation for deploying Beta 2 and running it in production? What caveats do you have for businesses eager to move to the new version now? We’re amazed at how many customers ask us to support their Beta 2 implementation in production. Honestly, we don’t recommend it since there is no Service Level Agreement (SLA) for Beta 2, but that has not stopped several customers. So far, they are having good experiences, but our recommendation is to get experience with the beta bits, start developing your applications, and plan to roll out your applications with the final version of SQL Server 2005. How compatible are SQL Server 2000’s BI tools (OLAP, DTS, data mining) and SQL Server 2005’s new BI tools? Because some of SQL Server 2005’s BI tools—such as Integration Services—are completely rewritten, will they still work with SQL Server 2000 data and packages? This is an area where we need to be very, very clear with our customers because the choice to upgrade or migrate varies depending on the situation. Our commitment is to be transparent about what will upgrade automatically and what will require migration, and we have migration aids for any objects that don’t come over automatically. For example, we will continue to support SQL Server 2000 DTS packages running beside SQL Server 2005 Integration Services. However, if you want to use some of the new SQL Server 2005 Integration Services features or performance, you will need to migrate your packages. We do not automatically migrate DTS packages because they usually contain code, very often in script, and the new SQL Server 2005 Integration Services has newer and better ways to do what that code used to do. In some cases, the benefits of the new technology will be worth rewriting the packages. SQL Server 2000 Analysis Services supports only clustering and decision-tree data-mining algorithms. Does SQL Server 2005 add support for other algorithms? Yes. The next version of SQL Server Analysis Services will include five new algorithms in the extensible data-mining solution. We have a great partnership with Microsoft Research that lets us cooperate on new data-mining algorithms, so we’ve identified the most popular requests and added algorithms for association sets, neural nets, text mining, and other needs. We also made enhancements to data mining, including a set of rich, graphical model editors and viewers in the BI Development Studio. We added support for training and querying data-mining models inside the extraction, transformation, and loading (ETL) pipeline. Developers will benefit from easy integration of data mining into their applications, and analysts will receive finer-grain control over their models. We’re excited about these enhancements because they address making data mining and data quality operational. Microsoft relies on an integrated technology stack—from OS to database to user interface. How does that integration help Microsoft’s BI offerings better serve your customers’ needs? Our belief in the Windows platform is long-standing and probably well understood by now. It’s important to note that while we have an integrated offering from top to bottom, it is also an open environment. This openness is critical for BI, where much of the opportunity for our customers is in gaining additional insight and value from the operational systems they already have. All of our BI Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 1 Building Better BI in SQL Server 2005 101 platform components can read data from a huge variety of databases and applications, and they provide Web services for embedding and integrating with other applications—even on other platforms. We get strength from the integration and consistency of the elements we provide, but lose nothing in terms of openness. Our customers benefit from the flexibility of our interoperability. By using our integrated solution, customers also witness a reduction in training time, management staff, and total cost of ownership. It’s a win-win situation. SQL Server 2005 will be the first release in which database tools converge with Visual Studio development tools. Can you tell us what it took to align these two releases and what benefits customers will realize from the change? Databases and applications used to be two separate worlds, but more and more, people are recognizing the similarities between application development and database development. For instance, what interesting business application doesn’t store and access data in a database? With Visual Studio 2005 (codenamed Whidbey) and SQL Server 2005, we’ve taken the next step in melding the database- and application-development experiences. We based our BI Development Studio on Visual Studio, and all the Visual Studio features that support team and enterprise development, including source-code control and deployment, also work for the data warehouse and BI developer. We built a single environment where people can develop all of the components of a data warehousing or BI application, including relational design, ETL, cubes, reports, data mining, and even code if desired. There is no other end-to-end, professional-grade environment for BI. The introduction of the UDM is said to blur the line between relational and multidimensional database architectures. This approach is new for the Microsoft BI platform. What are the most interesting features the UDM offers? And based on your experience, what features do you think will surface as the most valuable for customers and ISVs? Ultimately, OLAP is cool because it brings together navigation and query. Pivoting and drilling down are really just queries. But the OLAP world has never been attribute-rich; OLAP engines have never had good ways to express attributes, and adding something as simple as a phone number to a dimension would have caused size and performance issues in earlier SQL Server releases. With the UDM, we bridge the hierarchical drill-down world and the attribute-reporting world to present a dimensional view of data without losing the rich attributes present in the data. The UDM is also the place where we express business logic, since MDX calculated members and cells are expressions of business logic. The UDM adds time intelligence, account intelligence, and key performance indicators (KPIs). You might think KPIs are only calculations, but they are much more. A SQL Server 2005 KPI includes the calculation, an expression for the goal, an expression for the trend, and a means of visualizing the results. KPIs are first-class elements of the UDM. What tools will Microsoft add to the Visual Studio 2005 IDE to help developers create and manage SQL Server (and other database platforms’) users, groups, and permissions to better insulate private data from those who shouldn’t have access? The Visual Studio and SQL Server development teams work together on integration and new methods of managing data. Our team supplies components to Visual Studio, and they supply components to SQL Server. In SQL Server 2005, we’ve added Data Insulation features to the core SQL Server engine. The end result is that developers using Visual Studio can easily create the database elements they Brought to you by Microsoft and Windows IT Pro eBooks 102 A Jump Start to SQL Server BI need for their application. For enterprise-management activities, we anticipate that people will use SQL Server Management Studio. In one of your past conference keynote addresses, you mentioned that Microsoft is adding a new set of controls to Visual Studio 2005 to permit reporting without Reporting Services. Could you describe what those controls will do, when we’ll see the controls appear in Visual Studio 2005, and where you expect them to be documented? The reporting controls will ship with Visual Studio 2005 and SQL Server 2005, and they will enable programmers to embed reporting in their applications. We support both WinForms and WebForms. Programmers will either provide Report Definition Language (RDL) and a data set to the reporting control or point to an existing Reporting Services server. We think every application of any sophistication can use at least a little reporting against data contained in the application. These controls just make it easier. What benefit does 64-bit bring to SQL Server BI, and do you think 64-bit can really help the Microsoft BI platform scale to the levels that UNIX-based BI platforms scale to today? In a word: memory. The 64-bit architecture lets customers break out of the 3GB memory limit that they have with 32-bit SQL Server, which allows for far larger dimensions in OLAP. It also enables the new ETL engine in SQL Server Integration Services to hold more data and process rows that much faster. And yes, we absolutely think we will reach into the upper ranges of scale with 64-bit. Who are some BI vendors you’re working closely with to develop 64-bit BI computing? What’s important to recognize is not which vendors support 64-bit, but that SQL Server 2005 supports both 32-bit and 64-bit on Intel and AMD platforms. Our customers and partners can start with 32-bit and easily move to 64-bit later or take existing 32-bit applications to 64-bit with near-total transparency. This support means our customers and partners don’t have to worry about the differences because they are quite small and well documented. Did you leave out any BI features that you planned to add to SQL Server 2005 because of deadlines or other issues? We are confident that SQL Server 2005 will offer a comprehensive BI solution to address our customers’ business problems. We’ve worked closely with our customers for several years to determine their pain points and create BI tools that provide relief. We started delivering those tools with SQL Server 7.0 and OLAP and continued with SQL Server 2000, DTS, and Reporting Services. With SQL Server 2005, customers will have the complete package to integrate, analyze, and report data. Even after all that, we still have a million ideas! We’re already dreaming of what we can do beyond Yukon, so you can bet we’ll be charged up for the next round—right after we ship SQL Server 2005. It’s too early to discuss specifics, but as always, we’ll work with our customers to determine new features and technologies. Your team puts a lot of long hours into your work on SQL Server BI. What drives you and your BI developers to invest so much personally in the product? Even when I started with the BI team 8 years ago, we said the same thing we say now: Companies improve when more of their employees use BI. “BI for the Masses” is very motivating. Unlike some Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 1 Building Better BI in SQL Server 2005 103 of our competitors, our team is not working to provide a response to competitive offerings. The team works hard for the purpose of improving our product to best meet our customers’ needs. It might sound corny, but it truly is as much a journey as it is a destination. The personal investment across the board is impressive and humbling, and I’m awed by the effort our team contributes every single day. I hope it shows in our product. Brought to you by Microsoft and Windows IT Pro eBooks 104 A Jump Start to SQL Server BI Chapter 2: UDM: The Best of Both Worlds By Paul Sanders The next release of Analysis Services, coming in SQL Server Yukon, will combine the best aspects of traditional OLAP-based analysis and relational reporting into one dimensional model—the Unified Dimensional Model (UDM)—that covers both sets of needs. Compared to direct access of a relational database, OLAP technology provides many benefits to analytic users. OLAP’s dimensional data model makes it easy to understand, navigate, and explore the data. And OLAP’s precalculation of aggregate data enables fast response time to ad hoc queries, even over large data volumes. An analytic engine, supporting the Multidimensional Expression (MDX) query language, lets you perform analytic calculations. And OLAP’s data model includes rich metadata that lets you employ user-friendly, business-oriented names, for example. However, reporting directly from the underlying relational database still has its advantages. OLAP, traditionally oriented around star or snowflake schemas, doesn’t handle the arbitrary, complex relationships that can exist between tables. Reporting on the underlying database lets you handle flexible schema. OLAP cubes also expose data in predetermined hierarchies, making it unfeasible to provide true ad hoc query capability over tables that have hundreds of columns. Directly accessing the relational store means that results are realtime, immediately reflecting changes as they’re made, and you can drill down to the full level of detail. In addition, by not introducing a separate OLAP store, you have less management and lower total cost of ownership (TCO). Table 1 compares the advantages of relational versus OLAP-based reporting. Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 2 UDM: The Best of Both Worlds 105 Table 1 Many relational-based reporting tools try to gain some of OLAP’s advantages by providing a user-oriented data model on top of the relational database and routing reporting access through that model. So, the many enterprises that need to provide both OLAP and relational reporting commonly end up with multiple reporting tools, each with different proprietary models, APIs, and end-user tools. This duplication of models results in a complex, disjointed architecture. Analysis Services’ new UDM, however, combines the best of OLAP and relational approaches to enhance reporting functionality and flexibility. The UDM Architecture You define a UDM over a set of data sources, providing an integrated view of data that end users access. Client tools—including OLAP, reporting, and custom business intelligence (BI) applications— access the data through the UDM’s industry-standard APIs, as the diagram in Figure 1 shows. A UDM has four key elements: heterogeneous data access, a rich end-user model, advanced analytics, and proactive caching. In tandem, these elements transform sometimes difficult-to-understand data into a coherent, integrated model. Although the UDM enables a range of new data-access scenarios, it builds on SQL Server 2000 Analysis Services, allowing easy migration from Analysis Services 2000 and backward compatibility for clients. Let’s look at the UDM’s key components in more detail. Brought to you by Microsoft and Windows IT Pro eBooks 106 A Jump Start to SQL Server BI Figure 1 The UDM provides a bridge between end users and their data Heterogeneous data access. You can build a UDM over a diverse range of data sources, not just star or snowflake data warehouses. By default, you can expose every column in a table as a separate attribute of a dimension, enabling exposure of potentially hundreds of dimension-table columns that users can drill down on. In addition, a cube can contain measures drawn from multiple fact tables, letting one cube encompass an entire relational database. The model also lets different kinds of relationships exist between measures and their dimensions, enabling complex relational schemas. This structure supports degenerate dimensions, letting users drill down to the lowest level of transaction data. You can also build a UDM over multiple heterogeneous data sources, using information integrated from different back-end data sources to answer a single end-user query. These capabilities, combined with unlimited dimension size, let the UDM act as a data-access layer over heterogeneous sources, providing full access to the underlying data. Rich end-user model. The UDM lets you define an end-user model over this base data-access layer, adding the semantics commonly lacking in the underlying sources and providing a comprehensible view of the data that lets users quickly understand, analyze, and act on business information. The core of a UDM is a set of cubes containing measures (e.g., sales amount, inventory level, order count) that users can analyze by the details of one or more dimensions (e.g., customer, product). The UDM builds on Analysis Services 2000’s end-user model, providing significant extensions. For example, the UDM lets you define Key Performance Indicators (KPIs), important metrics for measuring your business’s health. Figure 2 shows how a client tool might display three sample KPIs, organized into display folders. Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 2 UDM: The Best of Both Worlds 107 Figure 2 Three sample KPIs organized into display folders Advanced analytics. You can augment the end-user model by using a comprehensive, script-based calculation model to incorporate complex business logic into UDM cubes. The UDM’s model for defining calculations provides something akin to a multidimensional spreadsheet. For example, the UDM can calculate the value of a cell—say, AverageSales for the category Bike in the year 2003—based on the values in other cells. In addition, the UDM might calculate a cell’s value based not only on the current value of another cell but also on the previous value of that cell. Thus, the UDM supports simultaneous equations. For example, the UDM might derive profit from revenue minus expense but derive bonuses (included in expenses) from profit. In addition to providing the powerful MDX language for authoring such calculations, the UDM integrates with Microsoft .NET, letting you write stored procedures and functions in a .NET language, such as C# .NET or Visual Basic .NET, then invoke those objects from MDX for use in calculations. Proactive caching. The UDM provides caching services that you can configure to reflect business and technical requirements—including realtime, or near realtime, access to data while maintaining high performance. The goal of proactive caching is to provide the performance of traditional OLAP stores while retaining the immediacy and ease of management of direct access to underlying data sources. Various UDM policy settings control the caching behavior, balancing the business needs for performance with an acceptable degree of latency. Examples of possible caching policies might be • “Answer all queries by using the latest, realtime data.” • “A 20-minute latency in the data is acceptable. Where possible, use a cache that’s automatically maintained based on change notifications received from underlying data sources. If at any point the cache is more than 20 minutes out-of-date, answer all further queries directly from the underlying source until the cache is refreshed.” • “Always use a cache. Periodically refresh the cache, avoiding peak-load times on the underlying sources.” Brought to you by Microsoft and Windows IT Pro eBooks 108 A Jump Start to SQL Server BI The UDM also provides a flexible, role-based security model, letting you secure data down to a fine level of granularity. And Yukon will include a full set of enterprise-class tools for developing and managing UDMs. The development tools, including an MDX query editor and an MDX debugger, are integrated with other SQL Server tools for building reports and Data Transformation Services (DTS) packages as well as with Visual Studio .NET. One Model for Reporting and Analysis The UDM combines the best of traditional OLAP and relational reporting, providing a single model that you can use as the basis for all your reporting and analysis needs. This flexible model allows data access across multiple heterogeneous data sources, including OLTP databases and data warehouses. And through the UDM, users can access all data, including the lowest level of transaction detail. With the UDM’s proactive caching, you can define policies to balance performance versus the need for realtime, or near realtime, information—without having to explicitly manage a separate Multidimensional OLAP (MOLAP) store. In addition, you can define a rich end-user model, including complex analytic calculations, to support interactive and managed reporting. Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 3 Data Mining Reloaded 109 Chapter 3: Data Mining Reloaded By Alexei Bocharov, Jesper Lind The two main functions of data mining are classification and prediction (or forecasting). Data mining helps you make sense of those countless gigabytes of raw data stored in databases by finding important patterns and rules present in the data or derived from it. Analysts then use this knowledge to make predictions and recommendations about new or future data. The main business applications of data mining are learning who your customers are and what they need, understanding where the sales are coming from and what factors affect them, fashioning marketing strategies, and predicting future business indicators. With the release of SQL Server 2000, Microsoft rebranded OLAP Services as Analysis Services to reflect the addition of newly developed data-mining capabilities. The data-mining toolset in SQL Server 2000 included only two classical analysis algorithms (Clustering and Decision Trees), a special-purpose data-mining management and query-expression language named DMX, and limited client-side controls, viewers, and development tools. SQL Server 2005 Analysis Services comes with a greatly expanded set of data-mining methods and an array of completely new client-side analysis and development tools designed to cover most common business intelligence (BI) needs. The Business Intelligence Framework in SQL Server 2005 offers a new data-mining experience for analysts and developers alike. Let’s quickly review the data-mining process. Then we’ll explore the seven data-mining algorithms available in the SQL Server 2005 Analysis Services framework and briefly look at the “plug-in” technology that can help you add new and custom algorithms to that framework. Although we couldn’t specifically address the data-mining UI design here, the snapshots included in several examples will give you a good first look at the power and usability of the new client-side tools. Mining the Data The design and deployment of a data-mining application consists of seven logical steps. First, you prepare the data sources: Identify the databases and connection protocols you want to use. Next, you describe the data-source views—that is, list tables that contain data for analysis. Third, define the mining structure by describing which columns you want to use in the models. The fourth step is to build mining models. SQL Server 2005 gives you seven data-mining algorithms to choose from—you can even apply several methods in parallel to each mining structure, as Figure 1 shows. The fifth step is called processing—that’s where you get the mining models to “extract knowledge” from the data arriving from the data sources. Sixth, you evaluate the results. Using client-side viewers and accuracy charts, you can present the patterns and predictions to analysts and decision makers, then make necessary adjustments. Finally, incorporate data mining into your overall data-management routine— having identified the methods that work best, you’ll have to reprocess the models periodically in order to track new data patterns. For instance, if your data source is email and your models predict spam, you’ll need to retrain the models often to keep up with evolving spammer tactics. Brought to you by Microsoft and Windows IT Pro eBooks 110 A Jump Start to SQL Server BI Figure 1 A choice of data-mining algorithms Here’s a quick example of a useful mining model. Let’s say you’re interested in identifying major groups of potential customers based on census data that includes occupational, demographic, and income profiles of the population. A good method for identifying large, characteristic census groups is to use the Clustering algorithm. This algorithm segments the population into clusters so that people in one cluster are similar and people in different clusters are dissimilar in one or more ways. To examine those clusters, you can use a tool called Microsoft Cluster Viewer (a standard Analysis Services 2005 component). Figure 2 shows one of the four views, giving you a side-by-side comparison of all the clusters. For instance, Clusters 6 and 7 correspond to persons not on active military duty. But Cluster 7 represents people who work longer hours for more income; the top row also suggests that people in Cluster 7 are mostly married. Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 3 Data Mining Reloaded 111 Figure 2 One of four views using Microsoft Cluster Viewer Prediction and Mutual Prediction Suppose you’ve selected just one column (e.g., Income) in a huge data table, designated that column as Prediction target, and now you’re trying to make some predictions. But you probably won’t get far by looking at just one column. You can compute the statistical mean and the variance range, but that’s about it. Instead, select specific values for one or more other columns (e.g., Age, Years of Experience, Education, Workload in census data tables) and focus only on those data rows that correspond to the selected values. You’ll likely find within this subset of rows that the values of the target column fall into a relatively narrow range—now you can predict the values in the target column with some degree of certainty. In data-mining terms, we say that those other columns predict the target column. Figure 3 shows a snapshot of the Dependency Network (DepNet) data-mining control. This DepNet is a diagram where arrows show which of the census columns predict which others. Some of the edges (nodes) have arrows pointing both ways; this is called mutual prediction. Mutual prediction Brought to you by Microsoft and Windows IT Pro eBooks 112 A Jump Start to SQL Server BI between A and B means that setting values of A reduces the uncertainty in column B, but also the other way around—picking a value of B would reduce the uncertainty of A. Figure 3 Snapshop of the Dependency Network data-mining control All Microsoft data-mining techniques can track prediction, but different algorithms make predictions in different ways. As we examine the other data-mining methods, we point out the prediction specifics of each method. Decision Trees Prediction is the main idea behind the Microsoft Decision Trees (DT) algorithm. The knowledge that a DT model contains can be represented graphically in tree form, but it could also appear in the form of “node rules.” For example, in a census decision tree for Income, a rule such as (Gender = Male and 1 < YearsWorked < 2) could describe a tree node containing the income statistics for males in their second year on the job. This node corresponds to a well-defined subpopulation of workers, and you should be able to make fairly specific predictions with regards to their income. Indeed, one of the census models gave the following formula under the condition of (Gender = Male and 1 < YearsWorked < 2): INCOME = 24804.38+425.99*( YRSSRV -1.2) +392.8*(HOURS-40.2) + 4165.82*(WORKLWK-1.022) ± 24897 Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 3 Data Mining Reloaded 113 According to this formula, INCOME is defined mostly by YRSSRV and weekly overtime. (Note that this is just an example and not based on representative census data.) To obtain this equation in a visually simple way, you could use the Decision Tree viewer to view the Income tree and zoom in on a node corresponding to the gender and yearsworked values of interest, as the typical snapshot in Figure 4 shows. Figure 4 A typical snapshot The rule and the formula we’ve discovered identify gender, years of service, years worked, weekly hours, and workload as predictors for income. Because YRSSRV, HOURS, and WORKLWK appear in the above formula for INCOME, they’re also called regressors. A decision tree that hosts such predictive formulas is called a regression tree. Time Series The Time Series algorithm introduces the concept of past, present, and future into the prediction business. This algorithm not only selects the best predictors for a prediction target but also identifies the most likely time periods during which you can expect to notice the effect of each predicting factor. For example, having built a model involving monthly primary economic indices, you might learn that the expected Yen-to-USD currency conversion rate today depends most strongly on the mortgage rate of 2 months ago and is related to the industrial production index of 7 months ago and per capita income of 6 to 7 months ago. Figure 5 shows a data-mining control called Node Legend that gives a graphical view of these dependencies. The long left-side blue bar next to Mort30 Yr (-2) indicates a negative correlation between Yen to USD and the mortgage rate 2 months ago—meaning that with time, as one value goes up, the other value goes down. Brought to you by Microsoft and Windows IT Pro eBooks 114 A Jump Start to SQL Server BI Figure 5 A data-mining control called Node Legend The purple curve (for Yen to USD) and the yellow curve (for the mortgage rate) in Figure 6 offer a nice graphical representation of this opposing movement of rates. Smaller blue bars in Figure 5 indicate that the exchange rate is to some extent self-sustaining; indeed, they highlight the fact that the rate today correlates well with the Yen-to-USD rate a month ago (coefficient 0.656) and somewhat with the rate 2 months ago (coefficient -0.117). So, when refinancing to a lower rate, you might consider cashing out and investing in Yen-backed securities—but first, you need to look at the prediction variances (and of course keep mum about the entire scheme). Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 3 Data Mining Reloaded 115 Figure 6 Graphical representation of rates Clustering and Sequence Clustering A new feature of Microsoft Clustering algorithms is their ability to find a good cluster count for your model based on the properties of the training data. The number of clusters should be manageably small, but a cluster model should have a reasonably high predictive power. You can request either of the clustering algorithms to pick a suitable cluster count based on a balance between these two objectives. Microsoft Sequence Clustering is a new algorithm that you can think of as order-sensitive clustering. Often, the order of items in a data record doesn’t matter (think of a shopping basket), but sometimes it’s crucial (think of flights on an itinerary or letters in a DNA code). When data contains ordered sequences of items, the overall frequencies of these items don’t matter as much as what each sequence starts and ends with, as well as all the transitions in between. Our favorite example that shows the benefits of Sequence Clustering is the analysis of Web click-stream data. Figure 7 shows an example of a browsing graph of a certain group of visitors to a Web site. An arrow into a Web page node is labeled with the probability of a viewer transitioning to that Web page from the arrow’s starting point. In the example cluster, news and home are the viewer’s most likely starting pages (note the incoming arrow with a probability of 0.40 into the news Brought to you by Microsoft and Windows IT Pro eBooks 116 A Jump Start to SQL Server BI node and the probability 0.37 arrow into the home node). There’s a 62 percent probability that a news browser will still be browsing news at the next click (note the 0.62 probability arrow from the news node into itself), but the browsers starting at home are likely to jump to either local, sport, or weather. A transition graph such as the one in Figure 7 is the main component of each sequence cluster, plus a sequence cluster can contain everything an ordinary cluster would. Figure 7 Example browsing graph of a group of visitors to a Web site Naive Bayes Models and Neural Networks These algorithms build two kinds of predictive models. The Microsoft Naïve Bayes (NB) algorithm is the quickest, although somewhat limited, method of sorting out relationships between data columns. It’s based on the simplifying hypothesis that, when you evaluate column A as a predictor for target columns B1, B2, and so on, you can disregard dependencies between those target columns. Thus, in order to build an NB model, you only need to learn dependencies in each (predictor, target) pair. To do so, the Naïve Bayes algorithm computes a set of conditional probabilities, such as this one, drawn from census data: Probability( Marital = “Single” | Military = “On Active Duty” ) = 0.921 Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 3 Data Mining Reloaded 117 This formula shows that the probability of a person being single while on active duty is quite different from the overall, population-wide probability of being single (which is approximately 0.4), so you can conclude that military status is a good predictor of marital status. The Neural Networks (NN) methodology is probably the oldest kind of prediction modeling and possibly the hardest to describe in a few words. Imagine that the values in the data columns you want to predict are outputs of a “black box” and the values in the potential predictor data columns are inputs to the same black box. Inside the box are several layers of virtual “neurons” that are connected to each other as well as to input and output wires. The NN algorithm is designed to figure out what’s inside the box, given the inputs and the corresponding outputs that are already recorded in your data tables. Once you’ve learned the internal structure from the data, you can predict the output values (i.e., values in target columns) when you have the input values. Association Rules The Association Rules algorithm is geared toward analyzing transactional data, also known as marketbasket data. Its main use is for high-performance prediction in cross-sell data-mining applications. This algorithm operates in terms of itemsets. It takes in raw transaction records, such as the one that Figure 8 shows, and builds a sophisticated data structure for keeping track of counts of items (e.g., products) in the dataset. Figure 8 Raw transaction records Transaction ID ——————— 1 1 2 2 2 Item —— Bread Milk Bread Milk Juice The algorithm creates groups of items (the itemsets) and gathers statistical counts for them. For Figure 8’s tiny sample record, the statistics would look like Figure 9. Figure 9 Statistics for Figure 8’s records Itemset ——————— <Bread, Milk> <Bread, Juice> <Milk, Juice> <Bread, Milk, Juice> Count —— 2 1 1 1 One of the most important parameters of a model is a threshold for excluding unpopular items and itemsets. This parameter is called the minimum support. In the preceding example, if you set the minimum support to 2, the only itemsets retained will be <Bread>, <Milk>, and <Bread, Milk>. Brought to you by Microsoft and Windows IT Pro eBooks 118 A Jump Start to SQL Server BI The result of the algorithm is the collection of itemsets and rules derived from the data. Each rule comes with a score called a lift score and a certain support value larger than or equal to the minimum support. The lift score measures how well the rule predicts the target item. Once the algorithm finds the interesting rules, you can easily use them to get product recommendations for your cross-sell Web sites or direct-mail materials. Third-Party Algorithms (Plug-Ins) The seven Microsoft algorithms pack a lot of power, but they might not give you the kind of knowledge or prediction patterns you need. If this is the case, you can develop a custom algorithm and host it on the Analysis Server. To fit into the data-mining framework, your algorithm needs to implement five main COM interfaces: 1. The algorithm-factory interface is responsible for the creation and disposal of the algorithm instances. 2. The metadata interface ensures access to the algorithm’s parameters. 3. The algorithm interface is responsible for learning the mining models and making predictions based on these models. 4. The persistence interface supports the saving and loading of the mining models. 5. The navigation interface ensures access to the contents of these models. Some of these interfaces are elaborate and take getting used to, but implementation templates are available in the Tutorials and Samples part of the SQL Server 2005 documentation. After you implement and register your algorithm as a COM object, hooking it up to the Analysis Server is as easy as adding a few lines to the server configuration. When the algorithm is ready and hooked up, its functionality immediately becomes available through the tools in the Business Intelligence Development Studio and SQL Server Management Studio. Analysis Server treats the new algorithm as its own and takes care of all object access and query support. Dig In Analysis Services 2005 represents a complete redesign of Microsoft’s BI platform. Embracing .NET, XML for Analysis, and ADOMD.NET, it offers an array of powerful new algorithms, full-featured designers, and viewers. Even bigger news is how open and transparent the platform has become. With Analysis Services 2005’s new client APIs, plug-in algorithm capabilities, server object model, managed user-defined functions (UDFs), and complete Microsoft Visual Studio integration, there’s virtually no limit to what a motivated BI developer can do. Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 4 What’s New in DTS 119 Chapter 4: What’s New in DTS By Kirk Haselden In early 2000, the Microsoft Data Transformation Services (DTS) development team I work on started revising DTS with the goals of building on previous success and of improving the product to support user requests and to provide a richer extraction, transformation, and loading (ETL) platform. We evaluated every aspect of DTS and decided to totally rewrite it. DTS in the upcoming SQL Server 2005 release, formerly code-named Yukon, sports many brand-new features as well as enhanced ones. Because so much of DTS is new in SQL Server 2005, I want to show you some of the most important changes and the new look of the DTS Designer. When I wrote this chapter, I was working with Beta 1 of SQL Server 2005 DTS, so some features might change in upcoming betas or in the final release. But if you’re already familiar with SQL Server 2000 and 7.0 DTS releases, you’ll be able to appreciate the coming improvements. SQL Server 2005 DTS Design Goals Because comprehending everything about DTS at a glance is difficult, let’s just take a quick look at the most important goals and how the goals drove the design and feature decisions the DTS team made in SQL Server 2005. Although these descriptions are brief, they should help you grasp the magnitude of the changes. Provide true ETL capabilities. Although the data pump in pre-SQL Server 2005 DTS is useful and flexible, most users recognize that it has its limitations and needs to be revamped. For example, the data pump supports only one source and one destination per pump. True enterprise ETL requires fast, flexible, extensible, and dependable data movement. SQL Server 2005 DTS provides this capability through the Data Flow Task—or, as our team calls it, the pipeline. The pipeline supports multiple sources, multiple transforms, and multiple destinations in one fast, flexible data flow. As of Beta 1, SQL Server 2005 DTS includes 26 transforms. The Conditional Split and Derived Column transforms use an expression evaluator to support operations that provide virtually limitless combinations of functionality for processing data. Other transforms such as the Slowly Changing Dimension, Fuzzy Match, Aggregate, File Extractor, File Inserter, Partition Processing, Data Mining Query, Dimension Processing, Lookup, Sort, Unpivot, and Data Conversion transforms provide powerful data-manipulation capabilities that don’t require scripting. This change is a real benefit because users can develop transformation solutions faster and manage them easier than hand-coded solutions. Distinguish between data flow, control flow, and event handling. SQL Server 2005 DTS emphasizes the differences between various kinds of data processing. In current DTS releases, users are sometimes confused when they try to distinguish between data flow and control flow because both appear on the DTS Designer surface. In SQL Server 2005 DTS, the concept of data flow includes all the activities users perform to extract, transform, and load data. Control flow comprises all the processes that set up a given environment to support ETL, including executing the data flow. SQL Server 2005 DTS also has event handlers that allow nonsequential control flow execution based on Brought to you by Microsoft and Windows IT Pro eBooks 120 A Jump Start to SQL Server BI events that tasks and other objects generate inside a package. SQL Server 2005 DTS clearly distinguishes between data flow, control flow, and event handling in the UI by showing them in separate Designer surfaces. Minimize disk usage. To make DTS into a screaming fast ETL tool, we needed to eliminate unnecessary disk writes, disk reads, and memory movement. Because ETL solutions can be quite complex, they typically involve some sort of disk caching and lots of memory movement and allocations. In some cases, you can’t avoid disk usage—for example, during data extraction, data loading, or aggregation or sorting of data sets that are larger than available memory. But in many cases, moving memory and caching aren’t necessary. The pipeline helps eliminate the avoidable cases by optimizing memory usage and being smart about moving memory only when absolutely necessary. Improve scalability. To be accepted as an enterprise ETL platform, SQL Server 2005 DTS needed the ability to scale. Users in smaller shops might need to run DTS on less-powerful, affordable commodity hardware, and users in enterprise environments want it to scale up to SMP production machines. SQL Server 2005 DTS solves this scalability problem by using multiple threads in one process. This approach is more efficient and uses less memory than using multiple processes. SQL Server uses this scaling approach successfully, so we decided to use the same method for DTS. Recognize the development-programming connection. Experienced DTS users know that developing packages is much like writing code, but DTS in SQL Server 2000 doesn’t support that connection very well. However, SQL Server 2005 DTS provides a professional development environment that includes projects, deployment, configuration, debugging, source control, and sample code. Package writers will have the tools they need to effectively write, troubleshoot, maintain, deploy, configure, and update packages in a fully supported development environment. Improve package organization. As packages grow in size and complexity, they can sometimes become cluttered and unintelligible. To address users’ concerns about managing larger packages, our team added more structure for packages and provided ways to better manage the objects in each package. For example, the DTS runtime, which houses the DTS control flow, now has containers that isolate parts of a package into smaller, easy-to-organize parts. Containers can hold other containers and tasks, so users can create a hierarchy of parts within the package. SQL Server 2005 DTS variables are now scoped, which means that variables in a container are visible only to the container where the variable is defined and to the container’s children. Containers also help users define transaction scope. In SQL Server 2005 DTS, users can define transaction scope by configuring the transaction in a container. Because a package can have multiple containers, one package can support the creation of multiple independent transactions. Users can also enable and disable execution of a container and all its children, which is especially useful when you attempt to isolate parts of the package for debugging or for developing new packages. On SQL Server 2005’s DTS Designer surface, users can collapse containers to simplify the visible package and view a package as a collection of constituent compound parts. Variables support namespaces, which simplify identification and eliminate ambiguity in variable names. All these features let users simplify complex packages. Eliminate promiscuous package access. In SQL Server 2005 DTS, the package pointer is no longer passed in to tasks, so tasks have no way to peruse the package and its contents. This design change discourages promiscuous access and profoundly affects the way users create DTS packages in SQL Server 2005 because it enforces declarative package creation, a process similar to coding. The Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 4 What’s New in DTS 121 change also simplifies package maintenance, troubleshooting, debugging, upgrading, and editing because the package logic is exposed in the Designer, not hidden inside a task. In SQL Server 2000 DTS, tasks sometimes use the package pointer to promiscuously access the internals of the package in which they’re running. This practice is a common use for the ActiveX Script Task. Scripting is desirable, but using the ActiveX Script Task this way creates packages that are difficult to understand and troubleshoot. It also makes updating packages difficult. For example, automatically upgrading a package that uses an ActiveX Script Task to loop inside the package is difficult because an upgrade utility would have to parse the script and modify it to work against the new object model. Continued support for tasks accessing the package object model would make upgrading packages to future DTS releases difficult as well. Also, this kind of promiscuous package access isn’t advisable because in SQL Server 2005 DTS, tasks would interfere with all the services the DTS runtime provides, causing unpredictable results. Removing promiscuous access has affected the set of runtime features. Many of the new DTS runtime features provide alternative ways of performing the functions that, in earlier DTS releases, the ActiveX Script and Dynamic Properties tasks provided. SQL Server 2005 DTS includes new loop containers, configurations, property mappings, and expressions that directly target the functional void that this change creates. These new features are better supported and more consistent and manageable than solutions you code yourself. So what’s happened to the ActiveX Script Task? Although it has a new, more powerful UI with integrated debugging, integrated Help, autocomplete, and Intellisense, the ActiveX Script Task, like all other tasks, is limited to providing only task behavior and can no longer modify the package during execution. Isolate tasks in control flow. The focus of control-flow functionality in DTS has shifted from tasks to the runtime. Although tasks still define package behavior, in SQL Server 2005 DTS, the runtime directs all aspects of control-flow execution order, looping, package termination, configuration, event handling, connections, and transactions. Tasks in SQL Server 2005 DTS are relatively isolated objects that have no direct access to the package, other tasks, or external resources. Tasks work only in their problem domain and only with the parameters the DTS runtime passes to them. A special container called a Taskhost imposes most of these limits. Taskhosts are generally transparent to the package writer and perform default behavior on behalf of tasks. Some of the Taskhost’s benefits are subtle, but one important benefit is that it simplifies writing a task that supports the new features such as breakpoints and logging. Connection managers are another feature that extends the runtime’s control over the environment in which tasks run. Connection managers are similar to connections in DTS in SQL Server 2000 but more extensive and more important. In SQL Server 2005 DTS, tasks and other objects use connection managers for accessing all external resources, including data from databases, flat files, Web pages, and FTP servers. Using connection managers lets the DTS runtime validate, detect, and report when a connection is using an invalid source or destination. The use of connection managers also lets users more easily discover what resources a package is accessing. Because resource access is confined to connection managers and not spread throughout the package in perhaps unknown or hard-to-find properties on tasks, use of connection managers simplifies package configuration, maintenance, and deployment. Improve extensibility. The Microsoft DTS development team wrote SQL Server 2005 DTS with the understanding that it was to be a true platform. By this, I mean that users can embed DTS in Brought to you by Microsoft and Windows IT Pro eBooks 122 A Jump Start to SQL Server BI their applications, write custom components and plug them into DTS, write management UIs for DTS, or use it for its original purpose—as a utility for moving data. Extensibility is a big part of what makes the new DTS a platform instead of a simple utility. Customers can still write custom tasks and custom transforms in SQL Server 2005 DTS, but the product contains new options that let customers write tasks and transforms by using managed code written in C#, Visual Basic .NET, and other .NET languages. And SQL Server 2005 DTS still supports writing custom components with Visual Basic (VB) 6.0, C++, and other native development languages. The new release also includes more types of extensible components. Previous DTS releases provide connectivity only through OLE DB connections. SQL Server 2005 DTS includes HTTP, FTP, OLE DB, Windows Management Interface (WMI), Flat File, File, and other connections, and users can write their own connections if the ones they want aren’t available. If users want to support new protocols or even new data-access technologies, they can create new connection types to support them without modifying other DTS components. The extensible connection feature benefits Microsoft and customers; it makes adding new connections simpler for Microsoft, and customers aren’t limited to what Microsoft provides. In SQL Server 2005 DTS, the runtime intrinsically supports looping through two new looping constructs in the form of containers. The Forloop container evaluates a user-defined expression at the beginning of each iteration, and the Foreachloop container iterates once for each item in a userprovided collection by using a new type of object called a Foreachenumerator. Because these loop constructs are containers, users can place tasks and other containers inside them and execute their contents multiple times. SQL Server 2005 DTS ships with several Foreachenumerators including SQL Server Management Objects (SMO), generic collection, XML nodelist, ADO, file, and multi-element enumerators. Foreachenumerators are also extensible, so if users want to support custom collections, they can write their own enumerators. SQL Server 2005 DTS supports another new type of object called a log provider. Log providers handle all the details of creating log entries for a given destination and format. SQL Server 2005 DTS lets users easily write their own log providers if the ones that ship in the box don’t meet their needs. The new product will ship with several log provider types, including Text, XML, event log, SQL Server Profiler, and SQL Server. The pipeline is also extensible. Users can write custom data adapters and transformations that plug into the pipeline. Users can also write pipeline data-source adapters to support a particular source’s format, parse the data, and put it into the pipeline. Likewise, pipeline data-destination adapters support removing data from the pipeline and loading it to the destination. Pipeline transforms are components that modify data as it flows through the pipeline. SQL Server 2005 DTS provides several options for writing pipeline data adapters and transforms, including using native code, managed code, or the Managed Script Transform. Redesigning the Designer SQL Server 2005’s DTS Designer is more capable and powerful than those in earlier DTS releases. The new DTS Designer is hosted in the Visual Studio shell to take advantage of all the features Visual Studio provides such as integrated debugging, Intellisense, source control, deployment utilities, property grids, solution management, and editing support. These features simplify building, managing, and updating packages. As I mentioned, data flow, control flow, and event handling are separated into dedicated panes in the DTS Designer. This separation makes it easier for users to see what the Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 4 What’s New in DTS 123 package is doing and to isolate parts of the package. SQL Server 2005 DTS supports debugging with features such as breakpoints, watches, errors, warnings, informational messages, and progress notifications. Packages now return more targeted and informative error messages that are visible in various locations in the Designer. Improvements to many UI features make the entire workspace better. Experienced Visual Studio users will quickly feel at home in the new DTS Designer because it’s so similar to other Visual Studio applications. But regardless of whether users are familiar with the environment, it’s intuitive enough that they’ll be comfortable working with it in no time. Let’s look at a few important new features of the SQL Server 2005 DTS Designer. Designer control flow. Figure 1 shows the Control Flow view in the SQL Server 2005 DTS Designer’s Business Intelligence Workbench. In the left pane of the window is a toolbox containing all the available tasks. Double-clicking a control-flow item or dragging it onto the Designer surface adds a new instance of the selected task or container to the control flow in the package. Figure 1 The Control Flow view On the Control Flow tab, you can see a model of the sequence container. In the model, the Send Mail Task has an arrow beneath it. To create a precedence constraint, a user needs only to drag the arrow to another task. The Connections tab, which lists a package’s data source connections, is in the pane below the design surface. The information in the Connections tab makes connections easier to find and clarifies the control flow. In SQL Server 2000 DTS, connections and tasks are combined Brought to you by Microsoft and Windows IT Pro eBooks 124 A Jump Start to SQL Server BI on one Designer page and are easy to confuse. Our team eliminated this confusion by visually separating connections from tasks. The list in the Variables pane at the bottom of the window includes each variable’s scope and data type. The right two panes in Figure 1 are the Solution Explorer and the Properties grid. SQL Server 2005’s DTS Designer supports Visual Studio projects, which keep track of files and settings related to the environment and the project files. The Solution Explorer provides a central location for managing projects. In this DTS Designer pane, you can manage Analysis Services and Reporting Services projects so that you can work with your cubes, reports, and packages in one solution. The Properties grid is a powerful tool for modifying packages. With it, you can view and edit the properties of any object visible within the DTS Designer, including tasks, precedence constraints, variables, breakpoints, and connections. The sample package on the Control Flow tab in Figure 1 shows how you can embed containers inside each other. The package has a Foreachloop container that holds an XML Task and a Sequence Container that holds a set of tasks. In SQL Server 2005 DTS, when you delete a container, you also delete all the tasks and containers it holds because variables and transactions are created on containers. So, a transaction on the Sequence Container would be scoped only to that container’s tasks and containers and wouldn’t be visible outside the Sequence Container to tasks or containers such as the Foreachloop or XML Task. This change makes SQL Server 2005 DTS more flexible than DTS in SQL Server 2000, in which users can create transactions only at the package level. Designer data flow. Figure 2 shows the DTS Designer’s Data Flow tab, which you can access by clicking the tab or double-clicking a Data Flow Task. This view is similar to the Control Flow view, with a few differences. When the Data Flow view is active, the toolbox in the left pane shows Data Flow Items, including data-source adapters, transforms, and data-destination adapters. To use these tools, users double-click them or drag them to the Designer surface. Figure 2 also shows the Output pane. DTS requires validation, which means that a component must confirm that it can successfully run when the package calls its Execute() function. If a component can’t run, it must explain why—DTS components communicate warnings, errors, or other information by raising events during package validation and execution. The SQL Server 2005 DTS Designer captures such events in the output window. Brought to you by Microsoft and Windows IT Pro eBooks Section III: New BI Features — Chapter 4 What’s New in DTS 125 Figure 2 DTS Designer’s Data Flow tab The Properties grid in Figure 2 shows a couple of interesting links. The Show Editor link, like the link of the same name on the Control Flow view’s Properties grid, opens the editor for the currently selected transform. The Show Advanced Editor link shows a generic editor that lets users edit transforms that have no custom UI. Because transforms in a Data Flow Task don’t execute in sequence, the DataView instead provides Data Viewers, UI elements that let users view data while it’s passing between transforms. Data Viewers are a powerful debugging feature that helps package writers understand what’s happening inside the pipeline. Migration Pain After reading about all the improvements, changes, and new features in DTS, you might wonder how the new product will work with legacy DTS packages. You might even anticipate problems with upgrading pre-SQL Server 2005 packages—and you’d be right. Early in the redesign of DTS, when we realized that we had to change the object model drastically, we also realized that the upgrade path from SQL Server 2000 DTS to SQL Server 2005 DTS would be difficult. After a lot of sometimesheated discussion, we decided that our customers would benefit most if the product was free from the limitation of strict backward compatibility so that the next generation of DTS would be based on Brought to you by Microsoft and Windows IT Pro eBooks 126 A Jump Start to SQL Server BI a more flexible design. Customers we spoke to told us that this choice was acceptable as long as we didn’t break their existing DTS packages. By now you’ve probably guessed that some of your packages won’t upgrade completely. However, we’ve provided some upgrade options that you can use to help ensure a smooth migration to SQL Server 2005 DTS. The first option is to run your existing packages as you always have. The SQL Server 2000 DTS bits will ship with SQL Server 2005, so you’ll still be able to execute your SQL Server 2000 DTS packages. The second option is to run SQL Server 2000 DTS packages inside SQL Server 2005 packages. You can do this by using the new ExecuteDTS2000Package Task, which wraps the SQL Server 2000 package in a SQL Server 2000 environment inside the SQL Server 2005 package. The ExecuteDTS2000Package Task will successfully execute your legacy packages and is useful in partial-migration scenarios while you’re transitioning between SQL Server 2000 and SQL Server 2005 DTS. If you want to upgrade your packages, you have a third option. SQL Server 2005 DTS will ship with a “best effort” upgrade wizard called the Migration Wizard that will move most of the packages that you generated by using the SQL Server 2000 DTS Import/Export wizard. If you have an ActiveX Script Task or a Dynamic Properties Task in your package, it probably does something that SQL Server 2005 DTS no longer allows, such as modifying other tasks or modifying the package. The migration wizard won’t be able to migrate those parts of the package. However, you can migrate packages a little at a time because SQL Server 2005 DTS will support SQL Server 2000 DTS side-byside execution. Fresh Faces, SDK, and Other Support Except for some new names, the Import/ Export Wizard and command-line utilities remain largely unchanged. DTSRun.exe is called DTExec.exe in SQL Server 2005 DTS. DTSRunUI.exe is called DTExecUI in SQL Server 2005 DTS, and it features a face-lift. We added a new command-line utility called DTUtil.exe that you can use for performing common administrative tasks such as moving, deleting, and copying packages. The utility also performs other tasks such as checking for the existence of packages. In addition, we included a new configuration wizard called the Package Configurations Organizer for creating package configurations, project-development capabilities that bundle a package with its configuration, and a self-installing executable for deploying packages to other machines. SQL Server 2005 DTS might ship with a software development kit (SDK). However, as of Beta 1, the plan for the SDK is still undefined. Some features you might expect are a task wizard, a transform wizard, and other component-creation wizards. More information about the SDK should be available as SQL Server 2005 gets closer to shipping. That’s the whirlwind tour of SQL Server 2005 DTS. As you can see, most of the concepts remain the same, but the product is brand-new. Indeed, by the time you read this, DTS might even have a new name that reflects that fact. Brought to you by Microsoft and Windows IT Pro eBooks Chapter 1 Chapter title 127 Brought to you by Microsoft and Windows IT Pro eBooks 128 A Jump Start to SQL Server BI Brought to you by Microsoft and Windows IT Pro eBooks Chapter 1 Chapter title 129 Brought to you by Microsoft and Windows IT Pro eBooks 130 A Jump Start to SQL Server BI Brought to you by Microsoft and Windows IT Pro eBooks Chapter 1 Chapter title 131 Brought to you by Microsoft and Windows IT Pro eBooks 132 A Jump Start to SQL Server BI Brought to you by Microsoft and Windows IT Pro eBooks Chapter 1 Chapter title 133 Brought to you by Microsoft and Windows IT Pro eBooks 134 A Jump Start to SQL Server BI Brought to you by Microsoft and Windows IT Pro eBooks