An Analysis Services Case Study: Using Tabular Models in a Large-scale Commercial Solution An SSAS Tabular Case Study with inContact, Inc. Writer: Alberto Ferrari (SQL Server MVP and BI Consultant at SQLBI.COM). Contributors: John Sirmon, Heidi Steen Technical Reviewers: Owen Graupman (inContact, Inc.), Lars Tice (inContact, Inc.), Marco Russo (SQL Server MVP and BI Consultant at SQLBI.COM) Published: May 2014 Applies to: SQL Server 2012 and 2014 Analysis Services, Tabular Summary: inContact, Inc. , the leading provider of cloud contact center solutions, replaced a costly reporting and data analysis component with an Analysis Services Tabular and Microsoft reporting implementation. An SSAS tabular model is an integral part of the architecture of inContact's pay-as-yougo cloud service that provides standard and analytical reporting as part of their service offering. This case study explains how we met inContact's stringent business requirements by designing a tabular solution that runs great on commodity NUMA hardware, in private data centers world-wide. Copyright This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. © 2014 Microsoft. All rights reserved. 2 Contents Introduction .................................................................................................................................................. 5 inContact Solution Design ............................................................................................................................. 6 Data Model Requirements ........................................................................................................................ 7 Distinct Counts ...................................................................................................................................... 8 Time Zones ............................................................................................................................................ 9 Near Real-Time.................................................................................................................................... 10 Capacity Planning Requirements ............................................................................................................ 10 Client Tool Requirements ....................................................................................................................... 11 Considering Options for the Data Warehouse ........................................................................................ 11 SQL Server and Columnstore indexes ................................................................................................. 11 SSAS Multidimensional ....................................................................................................................... 12 SSAS Tabular........................................................................................................................................ 13 Considering the Reporting Solution ........................................................................................................ 14 Building the Model.................................................................................................................................. 14 Memory Sizing and Compression ratio ............................................................................................... 15 Query Speed........................................................................................................................................ 19 Scalability ............................................................................................................................................ 21 Achieving Near Real-Time Processing ..................................................................................................... 22 Incomplete Contacts ........................................................................................................................... 22 Partitioning.......................................................................................................................................... 23 Processing Strategy ............................................................................................................................. 24 Evaluating Hardware Options ..................................................................................................................... 26 The complexity and ubiquity of NUMA................................................................................................... 27 Preparing the Test Cases......................................................................................................................... 29 Setting Node Affinity ........................................................................................................................... 31 Preparing the Test Queries ................................................................................................................. 36 Measuring results ................................................................................................................................... 37 Azure VM Feasibility Testing ............................................................................................................... 38 Testing scalability of the system ............................................................................................................. 40 3 Preparing the Test Environment ......................................................................................................... 41 Computing Maximum Number of Users Allowed ............................................................................... 43 End Result: Cost-effective architecture that keeps up with projected growth .......................................... 47 Hardware Design Decisions .................................................................................................................... 47 Software Design Decisions ...................................................................................................................... 48 Best practices and lessons learned ............................................................................................................. 49 Conclusion ................................................................................................................................................... 49 4 “As a consultant for SQLBI, I spend a lot of time with customers who develop BI solutions. Sometimes, I have the good fortune of finding an engagement that has all the hallmarks of a great project: a smart team, a challenging problem, and a willingness to use the newest technology to achieve the business objective. When it happens, the story is worth telling. With inContact, in 2013, it happened and this is their story.” - Alberto Ferrari, BI Consultant, SQLBI Introduction “Press 5 for more options…” might have sounded odd when first introduced, but today most of us don’t think twice about using a phone menu to access technical support, check a bank balance, or pay a bill -nor do we fail to recognize the importance of having this capability in the businesses we work in and with every day. inContact, Inc., provides a pay-as-you-go cloud service aimed at managing all aspects of a contact center. inContact’s multichannel Automatic Call Distribution (ACD) is at the core of a 100% cloud, endto-end solution which includes collection of customer interaction data to analysis and reporting of critical contact center metrics. Their customer base operates in over 125 countries worldwide and thus requires: Data centers on multiple continents and time zones. A multi-tenant architecture that can stretch with a growing customer base. Strict service level agreement (SLA) requirements that assure customers of a fast response when viewing reports about their own call center operations. Since 2011, inContact’s volume of transactions has nearly doubled year over year, placing a commensurate demand on the infrastructure (both software and hardware). BI reporting workloads are an integral component of all inContact solutions. In fact, their services include several analytical tools and reports that provide customer-specific insights into the contact center performance. By 2012, as the workloads base grew, declining performance and a high total cost of ownership in the existing reporting solution prompted inContact to evaluate alternatives, even if it meant a change in the underlying technology and a major redesign of the data warehouse. Ultimately, inContact chose SQL Server Analysis Services (SSAS). This whitepaper provides a detailed description of their experience in moving their BI reporting solution to an SSAS Tabular implementation. In this paper, we will explain how we navigated through multiple challenges before arriving at a successful outcome. For anyone who is evaluating Tabular models, we hope this paper can shed some light into the investigation and proof-of-concept testing required before going into production with this specific technology. 5 inContact Solution Design To set the context, let us quickly review the solution that inContact offers. For all customers utilizing the inContact pay-as-you-go call center software, data is collected on backend SQL servers in OLTP databases, where it is kept for a limited time (a few months). From the OLTP, the data moves steadily from transactional databases to a real-time data warehouse on SQL Server, where it is expected to remain for at least 10 years. A set of SSIS tasks, scheduled every five minutes (less, on some critical tables), copies the data from the OLTP to the data warehouse. Once data reaches the data warehouse, additional steps are required before users can effectively generate their reports. In the following figure, you can see a high-level representation of the data flow: ETL OLTP Optimize DWH In-memory Data Mart Figure 1. Data moves from OLTP to DWH, and then to Data Marts, before users can create reports The data warehouse needs to be near-real-time because no reporting is allowed on the OLTP database. All reports run on top of either the DWH or the data marts. Data marts contain pre-aggregated and preshaped information, while the DWH contains all the detailed information about the business. The data marts are used as the source for an in-memory database accessed by users who produce reports. The frequency of updates to the DWH and the data marts can be adjusted to smooth out performance problems or refresh larger tables. Larger tables tend to be updated more frequently, to reduce the amount of rows handled per run. The architecture is multi-tenant and geographically dispersed to support a global customer base. Thus, the databases are kept on separate servers, where each server supports between 200 and 500 customers. Each tenant contains both the OLTP and the DWH databases and might run a different version of the inContact proprietary software. All tenants are currently on servers in private data centers 6 managed by inContact. InContact wishes to build an architecture that could move to Microsoft Azure in the future, allowing more scalability and further reducing the total cost of ownership of the servers. The latter is not a strict requirement, but something the company is investigating as a long-term option. The databases for the OLTP, the DWH and the data marts run on the Microsoft SQL Server database engine. A third-party in-memory database application runs on the data marts. Finally, users build reports using a custom client-tool that provides a graphical UI to translate reports into native-language queries. Our primary goal in redesigning this architecture was to replace the analytical reporting component, building in scalability and cost savings every step of the way. Data Model Requirements The data model is an analytical description of contact center activities, created to support standard reporting, ad hoc exploration, and detailed investigation. Of utmost importance is the contact itself (either phone call or electronic) between the contact center and the people calling in. A contact can be initiated either by the contact center (Outbound) to give information, or by the customer (Inbound), typically to seek assistance. A contact is media agnostic and can change its media over time. For example, a first request for assistance by e-mail can progress to a phone call. During its execution, it might be handled by any number of operators, and assume different status attributes over time. For example, a phone call starts as “received”, then “handled”, it might be for some time “on hold” and finally gets to a “resolved” status. Admittedly, this description is an oversimplification of what is really recorded by the software, but it is enough to help us understand the concept of “contact status”, which is central to the inContact data model. In the following figure, you can see a simplified version of the data model: 7 Agents Calendar PK PK AgentId FK1 TeamId AgentKey Agent Log CalendarId FullDate Year MonthName HoursAMPM Range05Minutes Range15Minutes Range60Minutes FK8 FK2 FK1 FK3 FK4 BusinessUnitId CalendarId ContactStateId MediaId AgentId ContactKey Duration Media PK MediaId Media ContactStates PK ContactStateId ContactState Category1 Figure 2. Simplified version of the data model The software generates millions of status changes, where each row contains the status, start time, and end time. Most of the analytical metrics take the form of “how many contacts have held this status, in this time range, and for how long”. An example of such a metric is, “how many contacts were answered between 9:00 AM and 9:30 AM”. The data model presents some interesting challenges in the form of distinct counts, time zone handling, and near real-time reporting requirements. Distinct Counts As noted, a single customer contact is presented to the system repeatedly as it moves through various states. As someone who might be building such a model, if you wanted to compute the number of contacts in a period, it would be necessary to implement the calculation as a distinct count, because a simple count would sum all the transition statuses of the contact, leading to wrong results. Distinct count is the most important requirement in this data model. In fact, they are everywhere. The base measure (number of contacts) is a distinct count and all the metrics derived from the base measures are – again – based on distinct counts. For example, a metric showing the number of waiting contacts divided by the total number of contacts, will be computed as the ratio between two distinct counts. More complex formulas might require the computation of tens of distinct counts over the data set before arriving at the final value. A complex system like inContact requires nearly 300 different measures, all of which are based on distinct counts. As we will see when we get to the data warehouse section, SSAS Tabular was the only analytical data engine that could deliver solid performance on distinct counts, allowing us to adopt a simple modeling 8 design that promotes flexibility within the system. Thus, the distinct count issue was one of the major drivers towards a Tabular solution. Time Zones For inContact, date and time considerations are of paramount importance. Although dates and times are stored in the model in UTC (coordinated universal time), most customers need reports and calculations in different time zones, typically their own. Furthermore, a single report might contain contacts across multiple time zones, if the contact center is handling customers from different parts of the world. The problem of time zones is complicated by daylight saving time (DST). In fact, DST changes in different periods for different time zones. Moreover, in a single time zone, different countries apply different DST policies. A single contact typically spans a single DST period (there are some exceptions for very long contacts, but they are not relevant). The problem arises when you want a report over a long period, like one month or one year. In such a case, the report will contain contacts in different DST periods. For example, a contact on 1st of March at 9:00 AM might correspond to 1:00 AM in UTC while a contact on 2nd of March at 9:00 AM corresponds to 2:00 UTC, due to the change in DST. Time zone requirements need special handling in several aspects of the architecture: When filtering data, you need to build a filter that takes into account time zones and, most important, daylight saving time, as outlined before. When building the final report, you need to convert UTC times in the requested time zone. This step can be performed by a custom UI, but, if we want to use plain Excel PivotTables to query the model, the conversion should be implemented in the data model. Because we want to build time intelligence functionalities, all time intelligence calculations need to take into account time zones and, again, DST. In fact, if you compare a time range of two periods, each with different DST, all the calculations need to be adjusted to account for DST. The naïve solution to time zone handling is the creation of multiple columns for the timestamp of the status change of the contact. Basically, you not only store the UTC time, but two columns for each time zone (both the date and the date change because of time zones). Unfortunately, this solution is not viable in a large implementation like inContact. The reason is that there are around 40 different time zones in the world (yes, there are only 24 hours in a day, but there are time zones starting at half or a quarter of hour, and this is further complicated by DST, which might be different for different countries in the same time zone). Thus, following the canonical solution of duplicating columns would mean adding at least 40 pairs of date and time columns to the data model. When reasoning with tens of billions of rows, adding 80 columns is not the way to go. Thus, if we want to be able to report in any time zone, then all of the calculations need to be performed at query time, depending on the time zone requested by the user. This, of course, makes the overall system much more complicated, but it was a challenge that we knew could be solved in the model. 9 Near Real-Time The data warehouse is updated from the OLTP in near real-time. Most tables are updated every fifteen minutes, some of them (the more important ones) are updated every five minutes. One of the requirements of this project was to be able to update the analytical system every 15 minutes. Having a near real-time system is important because it completely frees the OLTP from having to handle analytical queries. Moreover, analysts using the data can be confident that the analytical system is upto-date. Perhaps the most salient point of all: it is very unlikely that anybody is going to be interested in doing a deep analysis of the last 15 minutes anyway. Should it become necessary to do so, analysis on the last 15 minutes could always be computed on the OLTP database in an efficient way, due to the small amount of data to analyze. Capacity Planning Requirements The inContact growth rate is impressive. It has consistently doubled the number of rows-per-month over the last several years. When we performed the initial evaluation, the OLTP produced nearly 200 million of rows per month, but with expected growth, that would lead to 400 million after one year, 800 million the year after and so on. Rows per Month 250,000,000 200,000,000 150,000,000 100,000,000 50,000,000 0 Figure 3. Rows per month chart shows an impressive growth in the data produced by the software inContact needed a solution that could accommodate projected growth over a five year trajectory. Thus, it was necessary to design a data model capable of handling tens of billions of rows and still produce results very quickly. 10 Client Tool Requirements Among the many factors under consideration, the client tool was important because the escalating cost of the client components was already driving the redesign of the existing solution. To be replaced: a web application that let users build queries, and reports the results using a simple browser. The cost of maintaining and developing that architecture could be significantly reduced by switching to Reporting Services and the Microsoft platform. One of the reasons why the Microsoft stack was chosen as a candidate for the new architecture is the ability to use Reporting Services to produce static reports, and Excel for data exploration and ad hoc reporting. Users could create their own reports and integrate metrics coming from other custom databases. Using Excel would free inContact IT from the need to store custom metrics in the model. This would better serve their largest customers by lowering the price for the service: a clear win-win scenario. Considering Options for the Data Warehouse The data warehouse is the reporting database that streams data to predefined and ad hoc reports. We started the project by evaluating the available technologies that could potentially satisfy the requirements of the reporting database while keeping costs under control. We looked at three options: Relational SQL Server database with columnstore indexes SSAS with a Multidimensional solution SSAS with a Tabular solution As with any project, when evaluating different technologies, there is never enough time to fully investigate every question and concern. Thus, for example, we did not spend too many cycles on the multiple time zone issue, as this would take away from higher priority problems. Instead, we focused on the tradeoffs of each technology, building small databases to run tests, check assumptions, and confirm performance. Knowing which problems to focus on, and which ones you can defer, is where the experience of the team working on the project makes the most difference. In the next several sections, we will drill down into considerations that lead to the final choice of Tabular. SQL Server and Columnstore indexes The first option we evaluated was that of designing the data warehouse using SQL Server, building a columnstore on the fact table and relying on the tremendous speed of columnstore queries to build all of the reports. There are several advantages to using plain SQL Server, along with columnstore indexes, to build the solution: 11 Very quick processing time. The processing of the model means the rebuilding of the last partition of the fact table: both the table content and the columnstore index. The operation is executed very quickly by re-creating the last partition into a temporary table, creating the index, and then switching-in the temporary table as a partition in the fact table. Using SQL as the query language. SQL is very well known and very powerful. Modifying the web-client tool to use SQL instead of the previous language was a good option. Single storage. Using SQL tables as the main analytical tool, we remove the processing time of SSAS and make it easier to handle near real-time. In fact, using SQL Server, we can go for a full real-time model. On the other hand, some important disadvantages had to be taken into account: Lack of Metadata. SQL Server does not contain a metadata layer, as SSAS does. Client tools like Excel will not be able to connect to SQL Server and build a PivotTable upon it. This means taking a dependency on the custom UI provided by inContact as part of their analytical tool. Security. SQL Server does not have an easy mechanism to build row-level security, which is needed by the solution. Again, the custom UI could be used to work around this, but recall that replacing the UI was an important design goal of this project. Unpredictable Query speed. In SQL 2012, specific query plans can really take advantage of columnstores (which are an absolute need for the volume of data of inContact). However, as soon as the query became complex (for example, mixing SUM with DISTINCT COUNT in the same query) SQL Server stopped using the columnstore and reverted to row mode, leading to poor performance. It looked nearly impossible to produce a general-purpose query engine that gave enough freedom to users in expressing their query while retaining efficiency. Inadequate Cache. A good cache can really make a difference in any analytical engine. In fact, by efficiently using the cache, you can greatly improve the single query execution time and, by extension, the scalability of the system. SQL Server is not read-only and, thus does not have the efficient cache system that is needed for inContact system. Thus, after building a simple POC, we decided that columnstore was not an option, and we turned our attention to SSAS. SSAS Multidimensional Multidimensional was our second trial. This solution would include a custom designed data mart, created to improve SSAS performance, and an OLAP cube based on that data warehouse. We knew that due to the volume of data, we would need to build a data mart specifically designed for the OLAP cube requirements. Multidimensional modeling is a mature, proven technology that we know well, so we knew how to evaluate its performance by studying the project characteristics and cube requirements, eliminating the need to build a POC or even a simple prototype. SSAS multidimensional has many advantages: 12 Complete and rich metadata layer. Using SSAS, we would gain a very rich metadata layer with measures and hierarchies, plus the ability to use Excel to query the model. Security. SSAS Multidimensional has everything needed to build a very secure environment satisfying the most complex scenarios. Good cache: SSAS cache would return the most frequently used queries immediately. Well-known technology: SSAS being a mature technology, we did not expect any surprises. We knew what the system had to offer and how to optimize it. On the other hand, there were these drawbacks to consider: Processing speed. Apart from using ROLAP partitions, which was not an option in this case, obtaining near real-time with SSAS is a pain, mainly because of the dependencies between measure groups and dimensions, which require complex processing strategies and, in the case of reprocessing of a dimension, the invalidation of existing aggregations. Distinct Counts. This is probably the major drawback of SSAS Multidimensional in this scenario. Because most of the metrics make use of distinct counts, optimizing them would have been nearly impossible. Finally, the presence of distinct count measure groups would lead to longer processing times, affecting the near real-time requirement. SSAS Tabular The new Tabular engine of SSAS seemed promising, because it shares most of the advantages of SSAS Multidimensional (in terms of metadata layer, security and cache usage). Nevertheless, there are other considerations to take into account: In-memory database. The database needs to fit in memory. This is a very formidable requirement, especially when the final dataset is going to be huge. A careful analysis of current and expected size is paramount in a project like this. Distinct Counts. Tabular shines at computing distinct counts. There is still a need for optimization but the performance is great out of the box, and no additional processing time is required. Processing speed. Having no aggregations, Tabular can reprocess both fact tables (partitions) and dimensions very quickly. Moreover, having the ability to partition dimensions contributes to a simple processing strategy that yields satisfactory processing time. This is a real life-saver for the near real-time requirement. Security. The security layer of Tabular is slightly less powerful than Multidimensional, and this requires thoughtful consideration. Security must be designed and reviewed, taking Tabular features into account. Scalability. Tabular is an engine that requires the full power of the server during a query, running all the cores up to 100% for a single query. How would Tabular perform in a multi-user environment, when hundreds of users are concurrently querying the server? This was a major concern and something we needed to thoroughly understand. New technology: Being new, Tabular was an exciting path to follow, but we knew that we were entering a world of unknowns. It was like seeing a big, flashing “Danger Ahead” sign. In the final analysis, we decided to build a full POC of the Tabular solution, as it looked like the more promising architecture. We knew there were still open issues we would run into during the design 13 phase, but nothing that looked insurmountable. Testing would prove whether Tabular could handle the scale we required, while still delivering the performance we expected. Considering the Reporting Solution By this point in the project, the reporting system became an easy choice. We had requirements for static reports and ad hoc data exploration, and the features available through existing software licenses filled those requirements nicely: SQL Server Reporting Services (SSRS) provides all the necessities for creating predefined, static reports. We used DAX queries in the reports, as it provides excellent performance and gave us the ability to fine-tune the queries, having full control over the query plan. Excel met our needs for ad hoc reporting. We gave customers direct access to their own reporting data with Excel. In this way, they had the ability to create their own reports. Having chosen SSAS Tabular as the technology, we stopped worrying about the custom UI used to build the queries. In fact, it quickly became apparent that users loved the ability to use Excel, long familiar to the analysts already using it to compute statistics. Thus, everybody was very happy about the ability to simply open a PivotTable and start browsing data. Finally, the Excel option opened up new opportunities with the usage of Power Pivot. An advanced Excel user can now easily extract data from the database and build custom data models, integrating data from different sources in complete freedom. Building the Model Building the model, in this scenario, was an easy task, because we only needed to follow canonical best practices to create a fact table and a set of slowly changing dimensions. Apart from security-related tables and some minor functional details, the final data mart is a perfect star schema. ”It is worth noting, before going further with the details of the implementation, that an inContact database contains data for hundreds of customers, which it classifies as business units. A business unit can combine multiple customers (think of large companies having several branches, where each branch is a customer, joined together in a single business unit). All the analysis needs to be at the business unit level. In other words, each query will always select, out of the whole database, data for a single business unit. This small detail turned out to be the major factor that drove the decision to partition databases at the business unit level.” - Alberto Ferrari, BI Consultant, SQLBI In building the proof of concept of the Tabular model, we had to check some critical points: 14 Memory sizing. This is normally the toughest problem whenever we develop a Tabular solution. The challenge lies in forecasting the future memory requirements of the model over time. Compression ratio. Tightly coupled with memory sizing, we needed to find the best compression ratio for the model, in part by carefully reducing the number of distinct values for columns and finding the optimal sort. Query Speed. We knew that Tabular is fast, but we needed to check the speed of complex queries, so as to be confident of the performance of the full model across a range of scenarios. Scalability. Probably the biggest issue for inContact, along with the expected growth rate, was designing a system that could keep up with company growth. Moreover, during the period in which we were prototyping, we felt the chill of exploring an uncharted territory; there was no shared experience to reassure us that we would succeed in using Tabular in a large multi-user environment like the one we were building. Let us consider each of these points in detail. Memory Sizing and Compression ratio In order to predict the memory requirement for the final data model, we had to draw a curve that shows the ratio between the number of rows in the model and the memory required to store the database. In Tabular, this is never an easy task. In fact, memory requirement can vary greatly thanks to the compression ratio. Furthermore, finding the best compression ratio is typically a matter of trial and error. “The problem with Tabular being in-memory is not the fact that the database will not fit in memory. As of today, you can easily buy terabytes of RAM at a good price and many databases would fit in such space. The problem is that RAM is a finite resource. Once you have 2TB of RAM, it is not easy to increase it to 2.5TB, in case you made some mistake in the initial evaluation. This is why it is important to forecast memory consumption: you can buy as much RAM as you want, but once you have it, you cannot easily increase it.” - Alberto Ferrari, BI Consultant, SQLBI It is useful to recap how Tabular stores data in a model. Tabular is a columnar database, meaning that data is not stored on a row-by-row basis. Instead, each column is stored as a separate data structure, and then linked back together by the engine on demand. 15 Figure 4. Column oriented databases store data column-by-column, not row-by-row. As you can in Figure 4, the Customers table is split into several columns and each column is stored as a separate entity. When you query the balance due by customer, Tabular reads the two columns (Name and Bal Due), joins them, and runs the query. Each column is compressed. The compression algorithm of Tabular is proprietary, but an example using the RLE (Run Length Encoding) algorithm might help in understanding the topic. As noted, each column is a separate data structure, so you think of a single column for this example. If you take a column containing the quarter of the year of the sale date, from a fact table, it will contain many repetitions of the same value. If you had one million sales in a year, it is likely that the value “Q1” repeats roughly 250,000 times in the first rows, followed by 250,000 instances of “Q2”, and so on for the remaining quarters. In such a case, which is very common in data warehouses, the engine can avoid storing multiple copies of the same value, instead replacing it with the value and a count. The count indicates how many times the value is repeated. In the next figure, you can see this in more graphical way: 16 Figure 5. R.L.E. reduces column storage by replacing repeating values with a count Moreover, columns are dictionary encoded. Dictionary encoding means that Tabular builds a dictionary of the values of a column and then replaces column values with the appropriate index in the dictionary. Regardless of the original data type, each column is represented as an integer value and Tabular converts to and from the original value by using the dictionary. Furthermore, Tabular uses the minimum number of necessary bits to represent each value. In fact, the number of bits used depends on the size of the dictionary and, ultimately, on the number of distinct values of a column. After encoding, the dictionary encoded column is compressed. Thus, the next figure shows the full flow of column compression in a Tabular database, using the xVelocity data structure: 17 Figure 6. Dictionary encoding transforms values into dictionary index, before performing compression To be honest, this is a simplification of what happens under the covers, yet it is enough to get a solid understanding of how to optimize a data model in Tabular. In short: Tabular stores data column-by-column. Each column has two structures: dictionary and data segments. We speak about “segments” because a large table is split into segments of 8 million (by default, configurable) rows each. Dictionary size depends on the values of the column, but it is often a small data structure and it is not worth our attention. Segments number and size depends on the number of distinct values of the column (the fewer the values, the less bits used for each entry) and on the number of rows in the table. You can see that there is a big difference between the logical and physical representation of data, big enough that estimating memory consumption using a simple formula is nearly impossible. It is much better to estimate the memory requirements by running trials. To get initial estimates, we populated the data model with 10 million rows and, by using Kasper de Jonge’s Excel workbook (http://www.powerpivotblog.nl/what-is-using-all-that-memory-on-my-analysisserver-instance/), we evaluated the size of both dictionaries and segments. For a large model, the dictionary size does not change much over time and is negligible, as compared against the size of segments (the physical data structure used for storing tabular data). After that, we repeated the measurements for 100, 200, 500 million, and 1 billion rows. By later connecting the dots, it was easy to draw a growth curve used to predict the maximum number of rows we could squeeze onto a server over a five year horizon. 18 One of the most interesting insights that comes from analyzing the memory consumption of a Tabular data model is that you oftentimes see that most of the memory is consumed by a handful of columns, typically those with the highest number of distinct values. However, counting distinct values is only one part of the story. A column of distinct values that are relatively static uses much less RAM than a column having fewer values that changes very often. In fact, you need to remember that although the dictionary is global to the table (i.e. there is a single dictionary for the whole table), within each segment, Tabular uses the minimum number of bits needed to represent the corresponding values of the column in the segment. Thus, if a column has 1,000 values, globally, but only two values for each segment, it will compress to a single bit in each segment. Moreover, if it does not change very often (think of a date, for example), then compression will work at its best, reducing the memory footprint of the column. On the other hand, a column with 1,000 values all represented with each segment (a time column is a good example) will need 10 bits for each value and, having many different values within each segment, it is likely that RLE compression will not reduce the data storage requirements in any significant way. Sometimes taking a closer look at the degree of precision actually needed for meaningful analysis opens up new possibilities for lowering memory consumption. A good example of this optimization was the duration of the contact states. In the OLTP and DWH database, the duration of each event is stored as an integer number, representing the number of milliseconds for which that event was active. From the analytical point of view, it is useless to store durations at such a high granularity (i.e. milliseconds). Reducing the precision of the duration to hundredths or to tenths of a second greatly reduced the memory footprint of the model, by simply reducing the number of distinct values of that column, without sacrificing the precision of the results. At the end of the optimization process, we figured out that the full database would fit into a few hundreds of gigabytes. Thus, we were confident that fitting Tabular into memory was not a big issue. At that point, we were not too worried about query speed, and we had to learn some lessons there, too. Query Speed Measuring query speed was not an easy task. The biggest issue with Tabular is that the query speed depends on many factors, most of which are not easy to grasp at first sight. The fundamental rule is very easy, indeed: the smaller the amount of RAM to scan, the faster the query will be. Unfortunately, this simple rule has many different and subtle variations that often prompt further study. Without going into too much detail, these were the most challenging aspects: 19 Ensure that distinct counts are fast enough on large datasets. As a general rule, we consider a query that lasts for more than two seconds to be slow. Verify that the most complex metrics can be computed on fairly large queries. In fact we had nearly 300 different measures, some of which create very complex filters (e.g., number of contacts dropped in the time frame having a cumulative wait time longer than 20 seconds, compared to any previous time frame). Check if anything can be pre-computed in the ETL phase, to speed up the queries. Distinct counts in Tabular are fast, under the right conditions. In fact, we were able to achieve very good results with columns with a low cardinality (few hundreds of values) but performance began to suffer with higher cardinality columns. The problem was that the main distinct count measure needed to be computed on a column that has many distinct values, in the range of hundreds of millions. NOTE: Distinct counts on columns with many values improved performance thanks to an enhancement introduced in cumulative update 9 to SQL Server 2012 SP1. Find more information at the following link: http://support.microsoft.com/kb/2927844 The column on which we had to perform the distinct count was the contact id (i.e. a unique id assigned to each contact). It has many distinct values because it needs to identify a contact. However, it turned out that this column, present only in the fact table for counting, can be re-arranged by splitting it into a business unit id and then a new contact id, which is unique inside the business unit. You might recall that all analysis is conducted at the business unit level, so by renumbering the contact id we did not reduce the expressivity of the system: we simply reduced the number of distinct values of the column. Another design decision that proved beneficial was to partition the solution by building a different database for each business unit. As you will learn later in the paper, this turned out to be the final shape of the solution (one database per business unit) and the idea of doing so emerged when analyzing query performance, while seeking out the best performance for distinct counts. During query speed analysis, yet another issue surfaced. Some of the more complex metrics required intricate conditions, as filter of CALCULATE functions. Those complex conditions required Boolean expressions involving several columns, which turned out to be slow. The solution here was to consolidate the expressions as columns in the model. Instead of writing measures this way: Measure := CALCULATE ( <Expression>, <Complex Condition> ) We computed the column at ETL time, allowing us to use a simpler measure: Condition = <Complex Condition> Measure := CALCULATE ( <Expression>, [Condition] ) You can easily appreciate this technique looking at the following example: 20 Metrics[System Pending] CALCULATE ( [Contacts], FILTER ( Fact, Fact[LogType] = && ( ( RELATED RELATED ) || ( RELATED RELATED RELATED ) ) ) ) = "A" ( AgentStates[AgentState] ) = "LoggedIn" || ( AgentStates[AgentState] ) = "LoggedOut" ( AgentOutStates[OutState] ) <> "HeldPartyAbandon" && ( AgentOutStates[OutState] ) <> "Refused" && ( AgentOutStates[IsSystemOutState] ) = TRUE This query uses a condition, which mixes four columns coming from different tables, and it is very expensive to evaluate at query time. Because the condition is a static one, it can be computed during ETL thus making all the queries much simpler. As an alternative to ETL, we evaluated the usage of calculated columns, but this would have increased the process time and the database size, offsetting any advantage to using this approach. Because the condition is the result of a Boolean expression, it contains only two values: true or false. Thus, it compresses very well and its memory footprint is negligible. The boost in performance is impressive, because instead of performing the evaluation of the condition at query time (which might involve the scanning and joining of multiple high-cardinality columns), we replaced it with the scan of a very small column. Another focus of investigation was finding ways to avoid joins at query time. When, for example, we had a dimension with only two columns: code and description, we removed the dimension and denormalized the description inside the fact table. This small operation reduced the number of joins at query time and subsequently increased the query speed. “Be aware that denormalizing is not always the optimal choice. It really depends on the size of the fact table, the number of rows in the dimension and the level of compression of the column in the fact table. As a general rule, denormalizing leads to better results but it is always good to perform some tests. We observed scenarios where filtering the dimension led to much better performance than filtering the fact table.” - Marco Russo, BI Consultant, SQLBI Scalability Once we were confident that the query speed was good, we turned our attention to scalability. Although query speeds were fast, the CPUs were all running up to 100% during the query and we had no 21 clear idea of what would happen when many concurrent users executed different queries at the same time. Although scalability was something to worry about, we did not have sufficient information at this stage of the project to make an educated guess. We knew that, as a last resort, we could always obtain good scalability by splitting the large database into as many as one database per business unit. But to develop a deep understanding of how well the solution would scale, we needed to wait until we could put the database on a test server and measure performance of concurrent users. Later, we will explore these findings in detail. Achieving Near Real-Time Processing Building a near real-time (NRT) analytical engine is one of the more complex and fascinating challenges in the BI space because you need to solve many issues, some of which are not so evident. Consider the following: Incomplete contacts. If we decided to go for a five-minute NRT, we knew that many contacts would be processed by the system before they were completed. In fact, a simple phone call might last for several minutes. Partitioning. Both the data warehouse and the SSAS model needed partitioning and we wanted to find the optimal partitioning schema to minimize reprocessing of large partitions. Processing strategy. Using SSAS Tabular, we needed to define an optimal processing strategy. There are two issues with processing: cache invalidation and long-running queries, both of which require thoughtful consideration. Let’s dig a little deeper into each of these issues. Incomplete Contacts In a perfect world, once data enters the data warehouse, it stays there forever, untouched. This is seldom true for many reasons (e.g. bad data, or structural updates to the data warehouse itself). For NRT data warehouses, the need for updates is even more common. Data is flowing constantly into the data warehouse, increasing the chances of an update in the source data. For inContact, there were several reasons why an incomplete contact might flow into the data warehouse: A contact not yet closed when the ETL process starts is by definition, incomplete. Network latency. Remember that the geographical sparseness of the servers means a huge latency for updates. Failed ETL processes. Something can always go wrong for a multitude of reasons. In such cases, incomplete (or even wrong) contact information could enter the system. For concerns such as these, the typical solution is to avoid worrying too much about the problem. Really. If you plan partitioning at the day level, for example, by fully processing a partition at the end of the day, you automatically solve the problem of incomplete information. In fact, if a contact is incomplete at 22 12:00, it will probably be complete at 12:15. Moreover, when reprocessing the full partition, the problem will fade away completely. Of course, a contact can last for several weeks (think, for example, of contacts handled by email) and the chances of having incomplete contacts are very high if we reprocess every few minutes. Reprocessing weeks of data was not an option, due to sheer volume of data. The solution was to keep all records, complete or not, in daily partitions. We reprocess the last 24 to 48 hours of data every fifteen minutes. If an incomplete record arrives outside that window, it will remain incomplete until the weekly reprocess. It takes a few days, but eventually the incomplete records sort themselves out. “An alternative solution that we considered, but did not implement, was to move incomplete contacts into a partition by themselves. That partition would always be a small one that could be quickly reprocessed. Thus, the model would contain partitions with complete contacts, plus an additional one with incomplete contacts. When loading new data, we would only have to reprocess the last partition, plus the incomplete one. With this design, the best partition key is not the starting date of a contact, but its closing date. A contact is created, lives for a while in the incomplete partition, and finally settles into the final partition of its closing date. Once it is there, it is there forever because a closed contact, by definition, never changes. This way you only need to reprocess two partitions: the last day (i.e. contacts closed today) and the incomplete contacts. This process is simple and more predictable, because we know for sure how many partitions needed reprocessing.” - Alberto Ferrari, BI Consultant, SQLBI Partitioning For various reasons, inContact chose to partition based on the start date of each individual contact state (rather than the contact as a whole). So if states 1-5 occurred on 3/31 and states 6-10 on 4/1, states 1-5 will be stored in the 20140331 partition, while states 6-10 will be stored in the 20140401 partition. States 1-5 will be available for query (possibly in an inferred state) before the contact has closed. This approach creates more data to process, but the increased atomicity means each part of the contact communication resolves sooner, as the contact transitions from state to state. Given how we needed to use partitions, the architecture of Tabular proved well-suited to the task. In Tabular models, partitions are not used to speed up the queries, but are rather used for data management only. By way of contrast, Multidimensional models are often partitioned by date so that a query using data for a specific period only needs to read the relevant information from disk (assuming that you choose a partition key that will be useful at query time). In Tabular, because everything is already in memory, a query always scans all of the partitions, giving you greater freedom in choosing the partition strategy. 23 Processing Strategy Once partitioning was decided, the processing strategy looked very clear: for each data refresh interval, we need to reprocess the daily partition to pick up changes in contact state. However, SSAS has two issues with reprocessing that had to be taken into account: cache invalidation and long running queries. Cache invalidation is easy to understand, but impossible to solve, at least in the context of NRT. Recall that the full cache of the database is invalidated when you reprocess even a single partition. The reason for this is that cached data might not be valid after processing, hence the need to invalidate and clear it out after each processing operation. Once the cache is gone, performance will suffer until the engine rebuilds the cache. That said, cache invalidation tends not to be a big issue, as it only creates tiny performance issues for a limited amount of time. Long-running queries, on the other hand, are both hard to understand and solve. To understand why long-running queries are a problem, recall what happens when you process a table (or a partition) in SSAS: 1. The partition is processed and, while processing, the system uses the old partition (which is still in memory) to answer all of the queries. 2. When the processing completes, the old partition is switched out and the new partition is switched in. From this moment on, the queries will find the new data and return updated information. The switch-in operation is very quick because it does not need to move data: it is only the metadata that switches. The problem is that for the switch-in to happen, no query should be running. Otherwise, we might replace data while a query is potentially scanning it. SSAS handles this by waiting for all the queries to stop before performing the switch-in. Obviously, SSAS cannot wait indefinitely for the system to become idle before performing the switch-in. Thus, it puts a lock request into the queue, waiting to acquire the lock when the system is idle. In order to avoid starvation, SSAS stops running new queries, if such a lock is pending, and will grant the lock only after all the queries currently running have finished their execution. During that period, queries are accepted and put into a queue. They will resume as soon as the lock is released. “It is important to note that the lock created by the switch-in after processing is an instancewide lock. This means that the lock is put on the server (or instance, if many instances of SSAS are running on the same server). Thus, processing a database for a single business unit, forces a lock on all the databases on the same instance. In the end, we went for a database per single business unit, so to make the handling easier and the queries faster but, because of the instance wide lock, splitting a large database into smaller ones does not help. In reality, it makes the scenario slightly worse because one lock will be created for each database processed, thus increasing the number of locks per time frame.” - Marco Russo, BI Consultant, SQLBI 24 The next figure shows, graphically, what happens when a processing operations ends. Figure 7. Long-running queries can create bottlenecks in the normal query processing of SSAS. At the beginning (leftmost part of the chart), you can see that three queries are being executed and, in the meantime, the server is processing a new partition. At (1), processing finishes and the system requests a lock. The long-running query is not yet finished, so the lock stays pending until the query ends. During the time between (1) and (2), new queries are accepted but not executed, due to the lock that is waiting. The only active query is the long running query. When the long running query ends, the lock is granted, the new partition is switched in and normal query processing can start again. Unfortunately, there are several negative side-effects that emerge from this behavior: Processing appears to be slower than it actually is, because the system waits for the lock before switching in the updated metadata. While the system is waiting for the lock, new queries do not start. You can see the gray part of queries 4-6, which is wait time. Thus, queries appear to be slower than they really are. When the lock is released, there is an overload of the server, which needs to process all the queued queries at once How do we handle the problem of long running queries and the locking? The answer is: “avoid long running queries”, i.e. create the database so that it will never need to run a query for more than a few seconds. As simple as it sounds, this is the most efficient answer to the long-running query: avoid them. 25 How you eliminate long-running queries depends on how the client application constructs them. Excel generates its own queries, whereas reports in SSRS use queries that you’ve already tuned. As an additional option, you can configure, in SSAS, a timeout after which the lock will be granted, killing any long-running query that is still running. Setting the timeout is usually needed because, even if you can control queries running from predefined reports in SSRS, there is no way to predict what end users will do with Excel PivotTables. A timeout prevents long-running queries from degrading overall server performance, thus affecting other users. “You’ll never completely avoid long-running queries because people are always going to ask “what if” questions. But, as soon as you recognize the business value in a long running query, you can permanently answer the question at an earlier step in ETL, so that the results are no longer long-running. A long running query is a kind of “technical debt”: an efficiency waiting to be realized.” -Owen Graupman, inContact After all these considerations, the pressure to make the query run very fast was very high. We simply knew that the response time needed to be awesome. This means not only designing the best data model, but also finding the perfect hardware on which to run the system. Evaluating Hardware Options Choosing a good server for Tabular is not an easy task. From the outset, we knew that Tabular needs a very fast CPU, at speeds not normally found on servers, as well as very fast RAM. The criteria to evaluate included: 26 CPU speed. The faster the CPU, the better the performance. Tabular does not need lots of cores; it just needs fast ones. CPU cache. Although memory is fast, CPU cache is still much faster. Tabular has numerous cache optimizations, so identifying a system that offered the largest cache in the cores was a priority. RAM speed. When data structures do not fit the cache, they reside in RAM. For this reason, RAM speed also made the short list. “After we built the initial prototype, loaded with one billion rows, we wanted to test the performance of a simple query on different CPU architectures: Intel and AMD. Given the range of CPUs produced by each vendor, it was simply impossible to test them all. Thus, we ran our test queries using the different architectures readily available to us via existing hardware. We borrowed servers from all over the company, even the laptops and workstations belonging to team members. After several trials, the clear winner was a video gaming machine that one guy on the team used at home. That computer outperformed any available server, running twice as fast as the serverclass machines we had in house. At that point, it was clear that the criteria for choosing the server would have to be expanded a bit, simply because it would have been impossible to convince the boss to build a cluster of gaming machines and trust it to serve our customers. But, honestly, if a business has the flexibility to buy gaming machines (assuming the machines can handle capacity) – do this.” - Owen Graupman, inContact The complexity and ubiquity of NUMA Most of the servers available on the market with enough expansion to handle terabytes of RAM are based on the Non Uniform Memory Access (NUMA) architecture. This is a big issue because Tabular is not NUMA aware. But why is NUMA an issue at all? To better understand this, we need to consider the NUMA architecture and how it affects the Tabular engine. Imagine a single machine having 1 Terabyte (TB) of RAM and 64 cores that spend their time scanning RAM to compute values. This is exactly what a Tabular instance does all day long. This architecture has a major bottleneck: the RAM bus. In fact, all the cores would probably spend more time in contention for the RAM bus than actually computing values. Furthermore, this happens only because there is a single RAM bus to service all cores, simultaneously. Needless to say, with that many cores, and that much RAM, it would be better to have more RAM buses. NUMA solves this problem by effectively adding more RAM buses. A NUMA designed server is built by connecting together many NUMA nodes, where each node has an equal portion of RAM, and a dedicated bus to access that RAM. By way of example, instead of having a single bank of 1 TB of RAM, a NUMA system might split it into four different nodes, where each node has 256 Gigabytes (GB) of RAM and 16 cores. It is clear that in such an architecture, contention for the memory bus is reduced because each processor has access to its dedicated RAM bus. However, NUMA would be incomplete if we only split the hardware into nodes. Four nodes do not make a NUMA server: you need to connect them through another bus and expose all four nodes to the operating system as if they were a single, giant computer. This is exactly what NUMA does: it connects all the nodes together through the NUMA bus and presents them as a single computer to the operating system, even if, internally, it is split into smaller nodes. The next figure shows a sample of NUMA architecture with only two nodes: 27 Figure 8. Simple architecture of a NUMA machine with two nodes. In this example, we have two nodes. Each node contains 4 CPUs and some local memory. When CPU 1 of Node 0 needs to access memory from node 0 (green arrow), it simply goes on the RAM bus and requests access to the memory. However, if it needs to scan RAM located on node 1 (red arrow), then it goes through the NUMA bus (indicated as intersocket connection in Figure 8) and requests Node 1 to provide access to the local memory of node 1. The great strength of NUMA architecture is that you can expand it by simply adding more nodes to the system. Using NUMA, you can build computers using quantities of cores and RAM that is nearly impossible to reach with a standard architecture. Yet, there are two problems in the architecture of NUMA: The memory speed is non-uniform (hence, the name Non Uniform Memory Access). A CPU accessing memory that is local to its node will be much faster, because the intersocket connection speed is far slower than that of a RAM bus. If the entire RAM used by a process is on a single node, then this architecture will not solve the problem of memory contention. In fact, all the cores would contend for the RAM on a single (remote) node, thus making the system even slower. To actually get the best performance from a NUMA machine, software needs to be NUMA aware. Being NUMA aware means that the software tries to reduce at a minimum the usage of the NUMA intersocket connection by preferring the usage of the local RAM bus. Software applications do this by setting processor affinity on specific nodes, ensuring that the data that needs to be processed is local to the CPU, at least to some extent. For example, if the database of a business unit can fit into a single node (as is normally the case) then the software can be optimized on NUMA by running the queries on the cores which are in that node. If a database does not fit into a single node, then the software could split the processing among many threads on different nodes, ensuring that each thread accesses only (or at least mainly) local RAM. 28 As you can see, programmers need to optimize the software to run efficiently on NUMA nodes. You can’t simply use any software on NUMA and expect good performance. The problems of memory access speeds are compounded when the software relies heavily on RAM. Unfortunately, not only is Tabular not NUMA aware, it also uses RAM as primary storage. Thus, the marriage between Tabular and NUMA is a not an easy one. All this discussion brings us to one very important point when working with Tabular solutions on NUMA hardware: node affinity. Under Windows, you can force a process to run on a single NUMA node, using different techniques we will demonstrate later. Setting affinity ensures that a process executes on a single node, and that memory allocation happens on that same node. As you will see, setting node affinity gave us good performance on NUMA hardware, even though Tabular itself is not NUMA aware. Preparing the Test Cases After running several tests on various CPUs, we narrowed the field to an HP DL560 G8 with 1.5 TB of RAM, where we were able to mount the kind of CPU that yielded the best performance. This is a NUMA box, with 4 NUMA nodes, each mounting 16 cores. In total, this system provides 64 cores, 1.5 TB of RAM. The fun part of understanding how to use it to full advantage was just starting. Whenever you work with a NUMA system, it is important to understand how it is designed and what kind of performance you can expect from it. On NUMA machines, the most important aspect is the cost of traversing the intersocket connection to access RAM. To obtain this information you can use the Coreinfo tool (http://technet.microsoft.com/en-US/sysinternals/cc835722.aspx). Coreinfo is a command-line utility that shows the mapping between logical processors and the physical ones, the NUMA node and socket on which they reside, as well as the caches assigned to each logical processor. Coreinfo provides a lot of information but the most important piece is the Cross-NUMA Access Cost, i.e. the expected performance degradation when using memory on a remote NUMA node instead of the local RAM bus. This information is visible through a matrix: 29 Figure 9. The Cross-NUMA Node Access Cost is shown as a matrix of cost between nodes. The two main points to take away from this picture are: Memory is contiguous per node, meaning that one has the first 25% of RAM, the second has the next 25%, and so on. There are no holes, so you can expect near RAM addresses to reside on the same node. The performance hit of traversing the NUMA connection can be as high as 2.1, meaning that the memory speed through the NUMA intersocket connection bus can be twice slower than samenode access. You get, through CoreInfo, the approximate cost incurred by each node when accessing memory associated with other nodes. When it came to deciding how to use this hardware, we tested several configurations: 30 One physical machine with no node affinity. We used 5 SSAS Tabular databases, each containing 750 total customers, on a single SSAS instance. Because we did not specify any processor affinity, the SSAS instance was running on all 64 cores. The drawback to this configuration is that SSAS took a performance hit when accessing memory on another NUMA node. One physical Machine affinitized per SSAS Instance via Windows System Resource Manager (WSRM). We tested two configurations: o One Tabular Instance hosting 5 databases containing 750 total customers (this option combines multiple customers into 5 different databases). We forced affinity to NUMA Node 0 by essentially limiting the SSAS instance to 16 logical processors and 384GB Memory, o Two Tabular Instances affinitized respectively to NUMA Nodes 0 and 1. The first instance hosted 2 databases and the second, the other 3 databases. Each instance of SSAS was limited to 1 NUMA node (16 logical processors and 384 GB Memory) Two VMs affinitized to NUMA Node 0 and 1 respectively via Hyper-V. Each virtual machine had one instance of SSAS, limited to 16 logical processors and 384 GB memory. The benefit of this configuration over using WSRM is that we do not need to worry about modifying the SSAS configuration file (msmdsrv.ini) to account for what the operating system sees and what the SSAS instance sees. In fact, many SSAS counters are dynamic based on the number of CPUs and memory on the machine. For example, if SSAS is limited to 384 GB and 16 cores the default values for a setting such as LowMemoryLimit will be skewed. The default of 65% is calculated based on the overall memory of the machine, which SSAS would evaluate as 1536 GB, then set the Low Memory Limit to roughly 998GB which would be way more than SSAS actually has available. “It is worth noting that, although the final decision was to split the database at the business unit level, we did not have the bandwidth to quickly generate one database per business unit for testing purposes. Thus, we created a representative sample of five test databases of different sizes, in order to understand the impact of database size on query performance.” - Lars Tice, inContact Setting Node Affinity We tested two methods of forcing NUMA affinity for the SSAS instances: command-line parameters and WSRM. As a first trial, we forced affinity of SSAS instance to a single NUMA node using a command line /affinity parameter and then modified TotalMemoryLimit, LowMemoryLimit, and VeritipaqMemoryLimit to 20%, 10%, and 20% respectively. We modified memory limits in msmdsrv.ini file because if we are bound to a NUMA node we are limited to 384 GB for that NUMA node. We needed to make sure that SSAS did not try to allocate more than it actually had. Using Start with the /affinity flag is more of a hack and requires the user starting up SSAS to stay logged into the system. For this reason, we do not recommend using this option, although it was fine for testing. We would have liked an option in the SSAS configuration such similar to <GroupAffinity>, that lets us specify a bitmask to determine which CPUs in a processor group are to be used for the thread pools. With that said GroupAffinity is not enabled for VertiPaq thread pools so we could not use the option. The second method of forcing NUMA affinitization for SSAS instances used was Windows System Resource Manager (WSRM). This appears to be the best option per instance (non-virtualized) however the tool itself is deprecated, and you will need to modify your .ini file for SSAS as mentioned above. The tool is not that intuitive, so we give some instructions on how to use it. To use WSRM you need to first add the feature from the “Add Roles and Features Wizard”. Once you have installed WSRM, you will then need to create a Resource Allocation Policy and a Process Matching criteria to force SSAS to one NUMA node. WSRM is listed as deprecated in MSDN however we didn’t find any compelling reason not to try it and it appears to be the only way to force processor affinity with 31 SSAS outside of virtualization. You can find detailed instructions here: http://technet.microsoft.com/enus/library/cc753939.aspx Before using WSRM you need to identify which processors are assigned to which NUMA node. Use CoreInfo to find out. Figure 10. CoreInfo shows the association between cores and nodes. Once the feature is installed, you can open WSRM: Figure 11. WSRM listed on the Start screen At this point, you can create a new resource allocation Policy for the processors assigned to the NUMA node that the SSAS instance runs under. In the screenshot below, notice that the first instance of SSAS runs on NUMA Node 0: processors 0–15 (there are 16 processors per NUMA node; the configuration is 0 based, so make sure you don’t choose 1 – 16). We specified the second instance to run under processors 16–31 (NUMA node 1) 32 Figure 12. The WSRM configuration we used for testing There is another alternative to force NUMA node affinity to SSAS: using Hyper-V VMs affinitized to NUMA nodes. Using Hyper-V appealed to us because we wouldn’t have to reconfigure our msmdsrv.ini file settings to limit the SSAS process to one NUMA node. We added this scenario to our range of tests. In Hyper-V manager, we created one VM for each instance of SSAS. Figure 13. Using Hyper-V we created one VM per SSAS instance. Then, in the settings of the VM, in the NUMA section under Processors (an option that is visible only if you are running Hyper-V on top of a NUMA machine), we set NUMA configuration to map each VM to a NUMA node. 33 Figure 14. The NUMA options in Hyper-V are enabled when you run on a NUMA machine. We calculated how much memory to set as maximum depending on the total memory available per NUMA Node and set Maximum amount of memory and Maximum NUMA nodes accordingly, as you can see in the following figure 34 Figure 15. Settings for a single virtual machine in Hyper-V. In addition to setting the NUMA affinity per VM, you can also set the following at the Host level. Unchecking the option “Allow virtual machines to span physical NUMA nodes” in theory should achieve the same behavior. 35 Figure 16. NUMA spanning can be configured at the host level too. Preparing the Test Queries Another important aspect that we needed to evaluate was the building of a good set of queries that realistically represented the user behavior. Because the system was very new, we had no previous history about how users would query it. “As a general rule, having a set of test queries at hand, with a baseline execution time, is very important for any project. When you build a solution, take the time to design a set of test queries and save the results in the project documentation. Moreover, when the project is running, grab the queries executed by users at regular intervals. It will save you a lot of time later, when you will need to make design decisions about the architecture.” - Alberto Ferrari, BI Consultant, SQLBI At first, we thought about writing some DAX queries returning small or large sets of results. but this would not have been a good representation of user activity. In fact, given that the two main client tools were Reporting Services and Excel, we knew that: 36 In Reporting Services, we could control the query loads by letting our DAX experts write and optimize fixed DAX queries that retrieve data for predefined reports. In Excel, users are free to run any kind of query, translated as MDX by Excel (it is worth noting that Excel still uses MDX to query Tabular, not DAX). Thus, we used the SQL Server Profiler to catch the queries executed by Excel during what we believed would have been a typical workload, when a user is browsing data and saving the results into SQL tables. Measuring results Once we had some test queries, we ran them on different configurations. The following figure reveals our findings: Figure 17. Comparison of execution time of some queries in different environments Some of the more salient points are as follows: The size of the database matters. A query executed against a database of 20 GB runs 20 times slower than the same query executed on a 1 GB database. Of course, this really depends from the shape of the query and how well it is written but, but because we don’t have full control over how the queries are generated, we saw enough evidence to prefer a smaller database size. The very same query, on the same database, runs much slower when executed on the full NUMA box (with no node affinity) as compared to an affinitized version. There is a performance differential between using Hyper-V or WSRM, but it is not huge. It is interesting to spend some time on a digression on why the usage of NUMA with no affinity led to such a degradation in performance. On our test machine, we had 64 cores and the database was small enough to fit in a single node. Thus, you can expect the full database to be on the RAM of a single node. Out of 64 cores, 48 of them will need to use the NUMA bus to access the RAM where the database is 37 stored, only 16 cores have access to the database using the local bus. Moreover, all 64 cores are accessing the same area of RAM, creating a lot of contention. On the other hand, when we executed the SSAS instance on a single node, only the cores in that node scan the RAM and they will find the data in their own NUMA node. You can see the behaviour of the query running on an affinitized instance of SSAS in the following figure: Figure 18. Query running on a single NUMA node uses cores of that node only. You can see that the query is running and only core in node 1 is busy. To make it more evident, we switched the CPU graph to NUMA nodes (an option that is available only if you are running Windows on a NUMA machine). As a final note, you can see that on a non-affinitized system, the engine takes twice as long to run the queries. It is not by chance: 2x is the price of traversing the intersocket connection versus local RAM. Azure VM Feasibility Testing As you might recall from the introduction, we were also interested in the feasibility of using servers on Microsoft Azure (using Azure VMs) instead of on-premises servers in the private data centers. 38 Due to time constraints, we did limited feasibility testing, trying out an A7 Azure Virtual machine (8 core, 56 GB memory) to check the performance of Tabular on an Azure VM. Figure 19. Creation of the A7 machine in Azure Once the virtual machine was provisioned and installed with SSAS 2012 Tabular, we ran CoreInfo to see if we were using NUMA hardware, and if so what the layout looked like. As you can see from the screenshot below, we have 2 NUMA nodes so we are likely to see a performance hit similar to onpremises SSAS Tabular, with no node affinity: 39 Figure 20. CoreInfo executed on an A7 machine shows its NUMA architecture. To check this assumption, we ran the two longest running queries from our query workloads. As you can see, the performance on the A7 Azure VM is consistent with the DL560G8, when neither have NUMA affinitization set for SSAS. Server Query 2 A7 Azure VM (No NUMA affinity) DL560G8-01 (no NUMA affinity) DL560G8-01 (NUMA affinity via Hyper-V) Query 15 0:01:00 0:01:01 0:00:23 0:00:29 0:00:30 0:00:11 Figure 21. Performance on the A7 machine is consistent with that of on-premises with no affinity set. The takeaway here is that when running in large Windows Azure VMs, we will encounter the same NUMA issues we see on-premises. The difference is that we have no way of addressing the NUMA issue in Azure. Of course, the situation is quickly changing, and over time we can expect it to get better, with further improvements to Azure VMs. Testing scalability of the system The final step in testing the solution was to check scalability by means of simulating the activity of 200 to 500 users concurrently logged in and querying the server. The goal was to check whether the server was able to run 500 users concurrently. Well, to tell the truth, we decided that 500 users would have been a good number for us, but the real question of interest was: “how many concurrent users can we support on a single server?” Knowing he answer to that question would have answered the next one, namely “how many servers do we need to accommodate current and projected demand?” 40 Preparing the Test Environment We used VSTS (Visual Studio Team System) with 1 controller and 5 agent machines to simulate the load on different configurations of SSAS Tabular: Figure 22. Shows the controller, VSTS agents, and the various machines used during the tests We grabbed the test queries from the database where we stored them and generated unit tests grouping together the sets of queries we wanted to execute. We generated each unit test automatically using this SQL script: DECLARE @sessionid VARCHAR ( max ) DECLARE @i INT SET @i = 1 DECLARE session_cursor CURSOR FOR SELECT DISTINCT sessionid FROM ProfilerTrace.dbo.Trace1 WHERE EventSubclass = 0 AND LEN ( CAST ( TextData as varchar ( max ) ) ) < 65000 AND sessionid IS NOT NULL OPEN session_cursor FETCH NEXT FROM session_cursor INTO @sessionid WHILE @@FETCH_STATUS = 0 BEGIN PRINT '[TestMethod]' PRINT 'public void Session' + CAST (@i AS VARCHAR ( 3 ) ) + '()' PRINT '{' PRINT ' string sessionid = "'+ @sessionid + '";' PRINT ' ExecuteTest(sessionid);' PRINT '}' SET @i = @i + 1 FETCH NEXT FROM session_cursor INTO @sessionid END CLOSE session_cursor DEALLOCATE session_cursor 41 The above SQL generates methods like the following one: Figure 23. Sample test method generated by SQL script. The sessionId is contained in the queries saved from the SSAS profiler trace. In this way, ExecuteTest is a general-purpose method that does the following: Figure 24. The ExecuteTest method runs a query against the server ExecuteTest uses the GetStatements method, which simply retrieves from the SQL Server database the set of queries, for a specific sessionId. 42 Figure 25. The GetStatements method retrieves the set of queries from the database Computing Maximum Number of Users Allowed Once we had the code ready, it was time to set up the environment and start loading the server, in various configurations. We needed to capture SSAS perfmon counters with Visual Studio across numerous different SSAS instances, on both physical and virtual machines. Adding these counters from our SSAS machines was very important because they are saved and synchronized to each test run in the Load Test Repository. This allowed us to examine performance counters easily for test runs completed in the past. The following chart shows the results of a load test simulating up to 500 max users on a Hyper-V affinitized instance of SSAS. 43 Figure 26. Performance counters when running up to 500 users on a Hyper-V affinitized VM We got up to 370 users, 247 user sessions and connections, before we started failing with connection errors. Eventually, the above test was aborted. Thus, a first negative result was the fact that the Hyper-V affinitized machine was not able to support our 500 user target. The same test on the HP ProLiant DL560 for 4 NUMA nodes, without node affinity, completed without any errors and we ultimately got to 507 connections and hit our goal of 500 maximum users. 44 Figure 27. Performance counters when running up to 500 users on a physical 4 NUMA nodes machine Lastly, if we look at the WSRM affinitized SSAS test runs we see that we also achieve 500 maximum users with no errors. Unfortunately, due to an issue with the SSAS performance counters on this instance of SSAS we were unable to capture the perfmon data. 45 Figure 28. Performance counters when running up to 500 users on SSAS with WSRM affinity Thus, it seemed that we need the full physical machine to run 500 users on it. However, number of users is just one side of the story. We were interested not only in the number of users, but in the performance of the queries too. To get a better understanding of the numbers, we exported the database with the results to a Power Pivot data model in Excel 2013 and started analyzing data with Pivot Tables. If you look at the average query times between the DL560 test runs and the WSRM Affinitized SSAS test runs, you can see that the average test time (average duration of Queries) is much better for the WSRM affinitized SSAS instance. 46 Figure 29. Average response time on WSRM versus 4 NUMA nodes machine Based on the test results, WSRM affinitized SSAS instances provides more throughput and better query performance than the 4 NUMA node without affinity. Hyper-V affinitized SSAS instances however do not perform well past 370 users. It appears that affinitizing SSAS using Windows System Resource manager does indeed provide better performance and throughput; however you need to keep in mind that WSRM has been deprecated. In addition, you must remember to modify the server configuration properties to account for the fact that we are constraining the processor and memory usage of SSAS instance to the resources available via the underlying node. End Result: Cost-effective architecture that keeps up with projected growth Now that you have followed us through the full evaluation process, let’s review the final architecture of the new reporting system we chose to go into production with. Hardware Design Decisions In the end, we opted for HP ProLiant BL460C workstation blades instead of the server models. For our computational workloads, the workstation blades significantly outperformed the server-class blades by a wide margin, up to two times the performance in most cases. These systems are NUMA. To get the right level of performance on NUMA nodes, we used Hyper-V to create VMs that run on a single NUMA node. We then installed one instance of SSAS on each node. Ultimately, we decided against WSRM because it is deprecated, and it seemed short-sighted to base a new architecture on deprecated software. 47 It is worth noting that the system we chose is not a particularly expensive piece of hardware. This further improves the cost-effectiveness of the overall solution, making it easier to fund future expansion. “Surprisingly, the high-MHz Intel E5 and E3-series (and their corresponding i7 extreme counterparts, which are gaming processors) were hands-down, the best processors we tested with. The E5 does outperform the i7, due to its memory bandwidth, but the E3-series was pretty close.” - Owen Graupman, inContact, Inc. Software Design Decisions On the software side, we settled on the following design. One Tabular data model, designed to support static SSRS reports and ad hoc data exploration in Excel, retrieving data from the transactional OLTP data of a particular business unit. One SSAS Tabular database per business unit. o Size ranges from few megabytes to tens of gigabytes, depending on the business unit. o Software controlled XAML scripts are used to automate provisioning of thousands of business unit databases. 15 minutes near real-time processing. o Reprocess of Current Day partition and Open Contacts partition. o Processing of the last partition takes approximately 1 second for a single business unit, while the full process depends on the size of the business unit, but falls within the range of 10-15 seconds. o A 15 second timeout interval helps avoid user-generated, long-running queries that could potentially deadlock the system. Multi-tenant architecture. o Multiple SSAS clusters, each composed of two VMs, with each VM running a single instance of SSAS Tabular. o Hyper-V used to set affinity to a single NUMA node. o Round-robin load balancing distributes requests between the two nodes of the cluster. o Each cluster can handle around 600 simultaneous user queries, for a variable number of business units. We built a set of tools to simplify database provisioning by using the programmability features of SSAS using XAML scripts. This approach gives us a quick way to move a business unit database from one machine to another. Moreover, by using XAML, no human intervention is needed when onboarding new customers: all provisioning is fully automated. To meet our SLA requirements, we limited databases and users on each VM based on the number of users and their expected activity. If needed, a scale-out strategy is always possible by using database synchronization with other SSAS instances and network load balancing (NLB) for these instances, but at the time we went into production, this step was not necessary. 48 The multi-tenant architecture is fully elastic to accommodate projected growth. Large customers can be hosted on a full cluster, while smaller ones can share the same cluster. If a customer requires additional power, it is absolutely simple and easy to move it to a better cluster, this gives good confidence of growth in the future. Needless to say, the system is now in production, and inContact customers are really happy about the performance and the analytical power of the Tabular model. And that is the reason why we had the time to write the whitepaper. Really. Happy. Customers. Best practices and lessons learned So what did we learn along the way? Here are a few lessons, old and new, to reflect upon. Use Hyper-V VMs to set node affinity. Tabular might not be NUMA aware, but it runs great if you install a tabular instance on VM that is affinitized to a particular node. This approach has an added benefit; you don’t need to adjust memory configuration settings. Distinct count performs better in a Tabular implementation. This is a well-known fact about Tabular, but if you are new to the technology, it’s worth remembering. This is especially true if the alternative you are evaluating is SSAS Multidimensional. Consider fast workstations as the hardware platform. Although this approach requires an open mind, choosing a system that offers maximum throughput and CPU cache is almost always the best choice for SSAS Tabular solution. It’s what inContact chose for their platform. Use partitions to design a more efficient processing strategy. Partition by a close date instead of start date gave us a more predictable processing strategy. Use ETL and timeouts to minimize long-running queries. When fast processing is a project requirement, reducing long running queries is essential. One approach is to make the data available via ETL processes. Handle time zone requirements at calculation time, especially if your databases are large. The brute force approach of adding columns to store DST information is impractical for large scale solutions. Handle the calculations in the model or data warehouse to avoid unnecessary bloat. Conclusion Using SSAS Tabular as the analytical engine allowed inContact to replace a costly component of their solution architecture, with no reduction in customer value in their “pay as you go” contact center cloud service. The Tabular model was straightforward to create, and easily supports predefined SSRS reports and ad hoc data exploration through Excel. 49 Using commodity hardware in the right configuration helped the team achieve great performance that meets service level agreements. When set up correctly, SSAS Tabular delivers solid performance on the NUMA systems so widely available in the business server market. Designing a database partitioning and provisioning strategy ensured that databases were correctly sized for the anticipated workloads, and that tools and techniques were in place for rapid response should a database need to be relocated. Scalability and expansion are built into the overall architecture, allowing inContact to incrementally add capacity to meet the computing needs of their ever growing customer base. For more information: http://www.sqlbi.com/: SQLBI Web site http://www.incontact.com/: inContact Web site http://www.microsoft.com/sqlserver/: SQL Server Web site http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example: Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason? Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing? This feedback will help us improve the quality of white papers we release. Send feedback. 50