Optimizing Distributed Database Design for Analytics Platform System Summary: This reference architecture is intended to provide proven practices for designing and implementing a distributed data warehouse with the Microsoft Analytics Platform system. The APS appliance is designed using a shared-nothing, massively parallel processing architecture, capable of loading and processing very large data volumes in an efficient manner. The contents of this paper are derived from a series of real-world deployments of the APS appliance for Microsoft customers. Writers: Michael Hlobil, Ryan Mich Technical Reviewers: Charles Feddersen, Mary Long, Brian Mitchell Published: June 2015 Applies to: Analytics Platform System Copyright This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. © 2014 Microsoft. All rights reserved. 2 Contents Introduction ................................................................................................................................ 4 Introduction to Analytics Platform System .................................................................................. 4 Overview of Analytics Platform System (APS) ........................................................................ 4 Shared Nothing Architecture of APS ....................................................................................... 5 APS Components ................................................................................................................... 6 MPP Engine ........................................................................................................................ 6 Data Movement Service (DMS) ........................................................................................... 6 SQL Server Databases ....................................................................................................... 6 Value Proposition of APS ....................................................................................................... 6 Handled Scenarios ................................................................................................................. 7 Greenfield DWH Design ...................................................................................................... 7 Migration of an Existing DWH ............................................................................................. 7 Reference Architecture for a Distributed Database / Datamart Design ....................................... 8 Overview of the Design Components ..................................................................................... 8 Database ............................................................................................................................ 8 APS Table Types ................................................................................................................... 9 Clustered Indexes .................................................................................................................16 Non-Clustered Indexes ..........................................................................................................18 Clustered Columnstore Index ................................................................................................18 Partitioning ............................................................................................................................22 Capabilities ........................................................................................................................22 Recommendations for Partitioning .....................................................................................23 Data Loading with Partition Switching ................................................................................24 Statistics ................................................................................................................................27 Recommendations for Statistics .........................................................................................28 Conclusion ................................................................................................................................29 3 Introduction In this whitepaper, we will describe a reference architecture for the design of the Analytics Platform System (APS). There are important decisions which have to be made in the early stages of solution design. This white paper provides guidance to these design decisions, potentially reducing the time to market of your data warehouse (DWH) solution. The following two implementation scenarios are described in this paper: • Greenfield Data Warehouse implementation o Design and develop a logical data model and implement the resulting dimension models as tables in APS. • Migration of an existing Data Warehouse to APS o Redesign of an existing symmetric multi-processor (SMP) solution, such as SQL Server, to a distributed architecture for APS. This paper describes a number of design components and considerations for a distributed data warehouse, and provides explanations on why certain design patterns are optimal for this scaleout platform. Introduction to Analytics Platform System Overview of Analytics Platform System (APS) APS is a massively parallel processing (MPP) data warehousing and analytics platform appliance built for processing large volumes of relational and semi-structured data. It provides seamless integration to Hadoop and Azure Blob Storage via Polybase (an integration layer included in APS), which enables non-relational storage to interoperate with relational storage through a unified language, SQL, and query optimizer. APS ships in an appliance form factor comprised of factory built and configured hardware with the APS software preinstalled. One benefit of this appliance model is the ability to quickly add incremental resources that enable near linear performance gains. This is accomplished by adding scale units to the appliance, which consist of servers and storage pre-tuned for optimal performance. The APS appliance is designed to support two different regions. The mandatory region is the parallel data warehouse (PDW) region, which supports relational storage and is a MPP-cluster of SQL servers tuned for data warehouse workloads. The optional region supports the Microsoft distribution of Hadoop, known as HDInsight. This whitepaper focuses exclusively on the design considerations for the MPP-RDBMS, the PDW region. APS is generally used as the central data store for a modern data warehouse architecture. It parallelizes and distributes the processing across multiple SMP (Symmetric-Multi-Processing) 4 compute nodes, each running an instance of Microsoft SQL Server. SQL Server Parallel Data Warehouse is only available as part of Microsoft’s Analytics Platform System (APS) appliance. The implementation of Massively Parallel Processing (MPP) in APS is the coordinated processing of a single task by multiple processers, each working on a different part of the task. With each processor using its own operating system (OS), memory, and disk. The nodes within the MPP appliance communicate between each other using a high-speed Infiniband network. Symmetric Multi-Processing (SMP/NUMA) is the primary architecture employed in servers. An SMP architecture is a tightly coupled multiprocessor system, where processors share a single copy of the operating system (OS) and resources that often include a common bus, memory, and an I/O system. The typical single server, with multi-core processors, locally attached storage, and running the Microsoft Windows OS is an example of an SMP server. Shared Nothing Architecture of APS APS follows the shared-nothing architecture; each processor has its own set of disks. Data in a table can be “distributed” across nodes, such that each node has a subset of the rows from the table in the database. Each node is then responsible for processing only the rows on its own disks. In addition, every node maintains its own lock table and buffer pool, eliminating the need for complicated locking and software or hardware consistency mechanisms. Because shared nothing does not typically have a severe bus or resource contention it can be made to scale massively. 5 Figure 1: Region PDW APS Components MPP Engine The MPP Engine runs on the control node. It is the brain of the SQL Server Parallel Data Warehouse (PDW) and delivers the Massively Parallel Processing (MPP) capabilities. It generates the parallel query execution plan and coordinates the parallel query execution across the compute nodes. It also stores metadata and configuration data for the PDW region. Data Movement Service (DMS) The data movement service (DMS) moves data between compute nodes and between the compute nodes and the control node. It bridges the shared nothing with the shared world. SQL Server Databases Each compute node runs an instance of SQL Server to process queries and manage user data. Value Proposition of APS An APS appliance consists of one or more physical server racks and is designed to be moved into an existing customer datacenter as a complete unit. Buying an APS appliance has the following advantages in the overall data warehouse life cycle: Reduced project cost / time to market 6 Project duration is significantly minimized due to higher automation of processes in the appliance. The appliance is preconfigured and ready for use. Reduced operational cost Tooling for monitoring, high availability, and failover is available out of the box. The scalability of the appliance is managed by adding resources. Reduced database administration efforts For DBAs, APS provides a reduced and documented set of administrative tasks for keeping the solution at peak performance. Handled Scenarios The following scenarios are based on the definition of OLAP or a data warehouse (DWH): A data warehouse is a copy of transaction data specifically structured for query and analysis. There are two most commonly discussed models when designing a DWH, the Kimball model and the Inmon model. In Kimball’s dimensional design approach, the datamarts supporting reports and analysis are created first. The datasources and rules to populate this dimensional model are then identified. In contrast, Immon’s approach defines a normalized data model first. The dimensional datamarts to support reports and analysis are created from the data warehouse. In this white paper, the focus will be on the Kimball dimensional model, as this layer exists in both design approaches. The dimensional model is where the most query performance can be gained. Greenfield DWH Design For a greenfield data warehouse project, a logical data model is developed for the different subject area data marts in accordance to the building blocks described by Kimball. The resulting dimension model is implemented as tables in APS. Migration of an Existing DWH The migration of an existing DWH can be done in the following ways. 7 1:1 Migration – The existing data model and schema is moved directly to the APS PDW region with little or no change to the model. Redesign – The data model is redesigned and the application is re-architected, following the SQL Server PDW best practices. This can help reduce overall complexity and enable the ability for the data warehouse to be more flexible and answer any question at any time. Evolution – The evolutionary approach is an enhancement to the above two, allowing to take advantage of the new capabilities of the new target platform. In this white paper, we will describe approaches applicable for all migration scenarios. Reference Architecture for a Distributed Database / Datamart Design Overview of the Design Components When optimizing a design for PDW, various components need to be considered. These include databases, tables, indexes, and partitions. We will describe design and runtime recommendations for each of these. Database When creating a PDW database, the database is created on each compute node and on the control node of the appliance. The database on the control node is called a shell database because it only holds the metadata of the database and no user data. Data is only stored on the compute nodes. The appliance automatically places the database files on the appropriate disks for optimum performance, and therefore the file specification and other file related parameters are not needed. The only parameters specified when creating a database are: REPLICATED_SIZE The amount of storage in GB allocated for replicated tables. This is the amount of storage allocated on a per compute node basis. For example, if the REPLICATED_SIZE is 10 GB, 10 GB will be allocated on each compute node for replicated tables. If the appliance has eight compute nodes, a total of 80 GB will be allocated for replicated tables (8 * 10 = 80). DISTRIBUTED_SIZE The amount of data in GB allocated for distributed tables. This is the amount of data allocated over the entire appliance. Per compute node, the disk space allocated equals the DISTRIBUTED_SIZE divided by the number of compute nodes in the appliance. For example, if the DISTRIBUTED_SIZE is 800 GB and the appliance has eight compute nodes, each compute node has 100 GB allocated for distributed table storage (800 / 8 = 100). LOG_SIZE The amount of data in GB allocated for the transaction log. This is the amount of data allocated over the entire appliance. Like the DISTRIBUTED_SIZE parameter, the disk space allocated per compute node equals the LOG_SIZE divided by the number of compute nodes in the appliance. AUTOGROW An optional parameter that specifies if the REPLICATED_SIZE, DISTRIBUTED_SIZE, and LOG_SIZE is fixed or if the appliance will automatically increase the disk allocation for the database as needed until all of the physical disk space in the appliance is consumed. The default value for AUTOGROW is OFF. Design Time Recommendations for Databases Sizing The size of the database should be based on the calculated sizes of the replicated and distributed tables. In addition, it is recommended that database growth be estimated for two 8 years and included with the initial database size calculation. This is the baseline for the size of the database. Additional space is also needed for the following reasons: To make a copy of the largest distributed table if needed (e.g. for testing different distribution columns) To have space for at least a few extra partitions if you are using partitioning To store a copy of a table during an index rebuild To account for deleted rows in a clustered columnstore index (deleted rows are not physically deleted until the index is rebuilt) These simple guidelines will provide sufficient insight into sizing the database. Autogrow It is recommended that the database is created large enough before loading data, and that autogrow is turned off. Frequent database resizing can impact data loading and query performance due to database fragmentation. APS Table Types SQL Server Parallel Data Warehouse (PDW) is a Massively Parallel Processing (MPP) appliance which follows the Shared Nothing Architecture. This means that the data is spread across the compute nodes in order to benefit from the storage and query processing of the MPP architecture (divide and conquer paradigm). SQL Server PDW provides two options to define how the data can be distibuted: distributed and replicated tables. Distributed Table A distributed table is a table in which all rows have been spread across the SQL Server PDW compute nodes based upon a row hash function. Each row of the table is placed on a single distribution as assigned by a deterministic hash algorithm taking as input the value contained within the defined distribution column. The following diagram depicts how rows would typically be stored within a distributed table. 9 Figure 2: Distributed Table Each SQL Server PDW compute node has eight distributions and each distribution is stored on its own set of disks, therefore ring-fencing the I/O resources. Distributed tables are what gives SQL Server PDW the ability to scale out the processing of a query across multiple compute nodes. Each distributed table has one column which is designated as the distribution column. This is the column that SQL Server PDW uses to assign a distributed table row to a distribution. Design Time Recommendations for Distributed Tables There are performance considerations for the selection of a distribution column: Data Skew Distinctness Types of queries run on the appliance When selecting a distribution column, there are performance considerations, one of which is data skew. Skew occurs when the rows of a distributed table are not spread uniformly across each distribution. When a query relies on a distributed table which is skewed, even if a smaller distribution completes quickly, you will still need to wait for the queries to finish on the larger distributions. Therefore, a parallel query performs as slow as the slowest distribution, and so it is essential to avoid data skew. Where data skew is unavoidable, PDW can endure 20 - 30% skew between the distributions with minimal impact to the queries. We recommend the following approach to find the optimal column for distribution. Distribution Key The first step in designing the data model is to identify the optimal distribution, otherwise, the benefits of the Massive Parallel Processing (MPP) architecture will not be realized. Therefore, always begin with designing distributed tables. Designing the data model with the optimal distribution key helps minimize as much movement of data as possible between compute nodes to optimize the divide and conquer paradigm. 10 Rules for Identifying the Distribution Column Selecting a good distribution column is an important aspect to maximizing the benefits of the Massive Parallel Processing (MPP) architecture. This is possibly the most important design choice you will make in SQL Server Parallel Data Warehouse (PDW). The principal criteria you must consider when selecting a distribution column for a table are the following: Access Distribution Volatility The ideal distribution key is one that meets all three criteria. The reality is that you are often faced with trading one off from the other. 11 Select a distribution column that is also used within query join conditions – Consider a distribution column that is also one of the join conditions between two distributed tables within the queries being executed. This will improve query performance by removing the need to move data, making the query join execution distribution local, fulfilling the divide and conquer paradigm. Select a distribution column that is also aggregation compatible – Consider a distribution column that is also commonly used within the GROUP BY clause within the queries being executed. The order in which the GROUP BY statement is written doesn’t matter as long as the distribution column is used. This will improve query performance by removing the need to move data to the control node for a two-step aggregation. Instead, the GROUP BY operation will execute locally on each distribution, fulfilling the divide and conquer paradigm. Select a distribution column that is frequently used in COUNT DISTINCT’s – Consider a distribution column that is commonly used within a COUNT DISTINCT function. This will improve query performance by removing the need to redistribute data via a SHUFFLE MOVE operation, making the query COUNT DISTINCT function execute distribution local. Select a distribution column that provides an even data distribution – Consider a distribution column that can provide an even number of rows per distribution, therefore balancing the resource requirements. To achieve this, look for a distribution column that provides a large number of distinct values, i.e. at least 10 times the number of table distributions. It is also important to check if the selected column leads to skew. Data skew, as already mentioned above, occurs when the rows of a distributed table are not spread uniformly across all of the distributions. Select a distribution column that rarely, if ever changes value – PDW will not allow you to change the value of a distribution column for a given row by using an UPDATE statement, because changing the value of a distribution column for a given row will almost certainly mean the row will be moved to a different distribution. It is therefore recommended that you select a distribution column which rarely requires the value to be modified. On some occasions, the choice of a distribution column may not be immediately obvious, and certain query patterns may not perform well when the distribution column is chosen using the general guidelines above. In these one-off scenarios, exercising some creativity in the physical database design can lead to performance gains. The following sections illustrate a few examples of these techniques that may not be readily apparent. Improving Performance with a Redundant Join Column If you have queries with multiple joins, there is an easy way to increase performance by adding an additional join column. As an example let’s assume there are three tables A, B and C. Table A and B are getting joined over column X and table B and C getting joined over column Y. By adding column X to table C and distributing all tables over column X, you increase the performance by making the join compatible because all tables are distributed over the same distribution column. A common example is needing to cascade the distribution column to child tables. For instance, it may make sense in a customer-centric data warehouse to distribute fact tables by customer. There may be an Order Header table and a child Order Detail table. In a traditional normalized schema, the Order Header table would have a relationship to the Customer table, and the Order Detail table would join to the Order Header table using the order number. In this scenario, the join between the header and detail tables is not distribution local. The customer key should be added to the detail table. Even though this column is not necessary from a data modeling perspective, it allows both order tables to share the same distribution column. Joins between Order Header and Order Detail should be on both the order number and the customer key so the join is distribution local. It should be noted that even if two tables have the same distribution column, they explicitly need to be joined on that column for the join to be distribution compatible. High Cardinality Columns and Distinct Count If you have queries including aggregations (like distinct count) over different columns having high cardinality, i.e. lots of different values, performance will vary greatly depending on which column is queried. For example, if a table is distributed over column X, a distinct count of column X will execute locally on each distribution, because a given value for column X can only exist in one distribution. The results of these parallel distinct counts will then be summed to calculate the distinct count across the entire table. However, a distinct count on a different column, column Y for example, will perform poorly. In the case of column Y (or any column but the distribution column), significant data movement can occur, slowing query response time. Full parallelism and full performance will only work for the distributed column. As an alternative, the same table can be created twice, with each table adopting a different distribution column. For instance, Table A can be distributed over column X and Table B can be distributed over column Y. A view consisting of two separate select statements in a union can then be created over these two tables. The view performs the distinct counts for each column based on the separate tables without any performance decrease. The following pseudocode illustrates how this could be implemented for a sample fact table. A set of common surrogate keys are included in each select statement, and the distinct count is performed against the distribution column of each table. This ensures that each select 12 statement executes in parallel across distributions for maximum performance The other column is set to zero in each select statement. Combining the results of each select in a union followed by an outer sum returns the complete result set, the equivalent of running multiple distinct counts against the original table. SELECT SUM(Count_X) as Count_X, SUM(Count_Y) as Count_Y FROM ( SELECT <surrogate keys>, COUNT(DISTINCT X) as Count_X, 0 as Count_Y FROM Table_A UNION ALL SELECT <surrogate keys>, 0 as Count_X, COUNT(DISTINCT Y) as Count_Y FROM Table_B ) X GROUP BY <surrogate keys> This code could be implemented in a view that is consumed by a BI tool so the distinct counts can be performed across multiple tables optimized for performance. Distinct counts of additional columns can be calculated by adding more tables to the view as needed, with each table distributed on the column being aggregated. There will be a loading and storage penalty of needing to maintain multiple tables, but query performance will be optimized, which is the end goal of a reporting and analytics solution. Distribution Columns with NULL Values In some cases, the ideal distribution column for satisfying queries has null records. All of the rows with a null value for the distribution column would be placed on the same distribution. If the column has a high percentage of nulls, the table will be heavily skewed and cause poor query performance. While the best solution may be to choose a different distribution column, this may introduce other problems. For example, if multiple fact tables all share the same distribution key, joining fact tables will be distribution local. Choosing a different distribution column for one fact table because it contains null records would make joins no longer distribution local, causing data movement. To address this issue, the table can be split into two physical tables. One table will have the same schema as the original table and be distributed over the nullable column. The second table will have the same schema as well, but will be distributed using a different column. The ETL process for loading the table will need to be modified. All rows having a non-null value for the distribution column will be loaded into the first table. All rows with a null distribution column will go into the second table, which is distributed on a different column. This will result in two tables that are both evenly distributed. Finally, these two separate tables can be reassembled for the user in a view having the name of the original table. Replicated Table A replicated table has a complete replica of all data stored on each of the SQL Server PDW compute nodes. Replicating a table onto each compute node provides an additional optimization by removing the need to shuffle data between distributions before performing a join operation against a distributed table. This can be a huge performance boost when joining lots of smaller 13 tables to a much larger tables, as would be the case with a dimensional data model. The fact table is distributed on a high cardinality column, while the dimensional tables are replicated onto each compute node. Figure 3: Replicated Table The above diagram depicts how a row would be stored within a replicated table. A replicated table is striped across all of the disks assigned to each of the distributions within a compute node. Because you are duplicating all data for each replicated table on each compute node, you will require extra storage, equivalent to the size of a single table multiplied by the number of compute nodes in the appliance. For example, a table containing 100MB of data on a PDW appliance with 10 compute nodes will require 1GB of storage. Design Time Recommendations for Replicated Tables Replicated tables should be viewed as an optimization technique, much in the same way you would consider indexing or partitioning. Identifying Candidates for Replicated Tables Replicating a table makes all joins with the table distribution compatible. The need to perform data movement is removed at the expense of data storage and load performance. The ideal candidate for replicating a table is one that is small in size, changes infrequently, and has been proven to be distribution incompatible. As a rough guideline, a “small” table is less than 5GB. However, this value is just a starting point. Larger tables can be replicated if it improves query performance. Likewise, smaller tables are sometimes distributed if it helps with data processing. 14 Figure 4: Sample Dimensional Model The dimensional model above depicts one fact table and four dimension tables. Within this example, the customer dimension is significantly larger in size than all other dimensions and would benefit from remaining distributed. The “Customer ID” column provides good data distribution (minimal data skew), is part of the join condition between the fact table and the customer dimension creating a distribution compatible join, and is a regular candidate to be within the GROUP BY clause during aggregation. Therefore, selecting “Customer ID” as the distribution column would be a good choice. All other dimensions are small in size, so we have made them replicated, eliminating the need to redistribute the fact table, and thus all data movement, when joining the fact table to any dimension table(s). Run Time Recommendations for Replicated Tables When loading new data or updating existing data, you will require far more resources to complete the task than if it were to be executed against a distributed table because the operation will need to be executed on each compute node. Therefore, it is essential that you take into account the extra overhead when performing ETL/ELT style operations against a replicated table. The most frequently asked question about replicated tables is “What is the maximum size we should consider for a replicated table?” It is recommend that you keep the use of replicated tables to a small set of data, tables less than 5GB in size is a good rule of thumb. However, there are scenarios when you may want to consider much larger replicated tables. For these special scenarios, we would recommend you follow one of these approaches to reduce the overall batch resource requirements. 15 Maintain two versions of the same table, one distributed in which to perform all of the ETL/ELT operations against, and the second replicated for use by the BI presentation layer. Once all of the ETL/ELT operations have completed against the distributed version of the table, you would then execute a single CREATE TABLE AS SELECT (CTAS) statement in which to rebuild a new version of the final replicated table. Maintain two tables, both replicated. One “base” table persists all data, minus the current week’s set of changes. The second “delta” table persists the current week’s set of changes. A view is created over the “base” and the “delta” replicated tables which resolves the two sets of data into the final representation. A weekly schedule is required, executing a CTAS statement based upon the view to rebuild a new version of the “base” table having all of the data. Once complete, the “delta” table can be truncated and can begin collecting the following week’s changes. Clustered Indexes A clustered index physically orders the rows of the data in the table. If the table is distributed, then the physical ordering of the data pages is applied to each of the distributions individually. If the table is replicated, then the physical ordering of the rows is applied to each copy of the replicated table on each of the compute nodes. In other words, the rows in the physical structure are sorted according to the fields that correspond to the columns used in the index. You can only have one clustered index on a table, because the table cannot be ordered in more than one direction. The following diagram shows what a Clustered Index table might look like for a table containing city names: 16 Figure 5: Clustered Index Design Time Recommendations for Clustered Indexes When defining a clustered index on a distributed table within PDW, it is not necessary to include the distribution column within the cluster key. The distribution column can be defined to optimize the joins while the clustered index can be defined to optimize filtering. Clustered indexes should be selected over non-clustered indexes when queries commonly return large result sets. Clustered indexes within PDW provides you with a way to optimize a number of standard online analytical processing (OLAP) type queries, including the following: Predicate Queries Data within the table is clustered on the clustering column(s) and an index tree is constructed for direct access to these data pages, therefore clustered indexes will provide the minimal amount of I/O required in which to satisfy the needs of a predicate query. Consider using a clustered index on column(s) commonly used as a valued predicate. Range Queries All data within the table is clustered and ordered on the clustering column(s), which provides an efficient method to retrieve data based upon a range query. Therefore, consider using a clustered index on column(s) commonly used within a range predicate, for example, where a given date is between two dates. 17 Aggregate Queries All data within the table is clustered and ordered on the clustering column(s), which removes the need for a sort to be performed as part of an aggregation or a COUNT DISTINCT. Therefore, consider using a clustered index on column(s) commonly contained within the GROUP BY clause or COUNT DISTINCT function. Non-Clustered Indexes Non-clustered indexes are fully independent of the underlying table and up to 999 can be applied to both heap and clustered index tables. Unlike clustered indexes, a non-clustered index is a completely separate storage structure. On the index leaf page, there are pointers to the data pages. Design Recommendations for Non-Clustered Indexes Non-clustered indexes are generally not recommended for use with PDW. Because they are a separate structure from the underlying table, loading a table with non-clustered indexes will be slower because of the additional I/O required to update the indexes. While there are some situations that may benefit from a covering index, better overall performance can usually be obtained using a clustered columnstore index, described in the next section. Clustered Columnstore Index Clustered columnstore indexes (CCI) use a technology called xVelocity for the storage, retrieval, and management of data within a columnar data format, which is known as the columnstore. Data is compressed, stored, and managed as a collection of partial columns called segments. Some of the clustered columnstore index data is stored temporarily within a row-store table, known as a delta-store, until it is compressed and moved into the columnstore. The clustered columnstore index operates on both the columnstore and the delta-store when returning the query results. Capabilities The following CCI capabilities support an efficient distributed database design: 18 Includes all columns in the table and is the method for storing the entire table. Can be partitioned. Uses the most efficient compression, which is not configurable. Does not physically store columns in a sorted order. Instead, it stores data in the order it is loaded to improve compression and performance. Pre-sorting of data can be achieved by creating the clustered columnstore index table from a clustered index table or by importing sorted data. Sorting the data may improve query performance if it reduces the number of segments needing to be scanned for certain query predicates. SQL Server PDW moves the data to the correct location (distribution, compute node) before adding it to the physical table structure. For a distributed table, there is one clustered columnstore index for every partition of every distribution. For a replicated table, there is one clustered columnstore index for every partition of the replicated table on every compute node. Advantages of the Clustered Columnstore Index SQL Server Parallel Data Warehouse (PDW) takes advantage of the column based data layout to significantly improve compression rates and query execution time. Columns often have similar data across multiple rows, which results in high compression rates. Higher compression rates improve query performance by requiring less total I/O resources and using a smaller in-memory footprint. A smaller in-memory footprint allows for SQL Server PDW to perform more query and data operations in-memory. Queries often select only a few columns from a table. Less I/O is needed because only the columns needed are read, not the entire row. Columnstore allows for more advanced query execution to be performed by processing the columns within batches, which reduces CPU usage. Segment elimination reduces I/O – Each distribution is broken into one million rows called segments. Each segment has metadata that stores the minimum and maximum value of each column for the segment. The storage engine checks filter conditions against the metadata. If it can detect that no rows will qualify, then it skips the entire segment without even reading it from disk. Technical Overview The following are key terms and concepts that you will need to know in order to better understand how to use clustered columnstore indexes. 19 Rowgroup – Group of rows that are compressed into columnstore format at the same time. Each column in the rowgroup is compressed and stored separately on the physical media. A rowgroup can have a maximum of 220 (1,048,576) rows, nominally one million. Segment – A segment is the basic storage unit for a columnstore index. It is a group of column values that are compressed and physically stored together on the physical media. Columnstore – A columnstore is data that is logically organized as a table with rows and columns physically stored in a columnar data format. The columns are divided into segments and stored as compressed column segments. Rowstore – A rowstore is data that is organized as rows and columns, and then physically stored in a row based data format. Deltastore – A deltastore is a rowstore table that holds rows until the quantity is large enough to be moved into the columnstore. When you perform a bulk load, most of the rows will go directly to the columnstore without passing through the deltastore. Some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup. When this happens, the final rows go to the deltastore instead of the columnstore. Tuple Mover – Background process which automatically moves data from the “CLOSED” deltastore into compressed column segments of the columnstore. A rowgroup is marked as closed when it contains the maximum number of rows allowed. Figure 6: Clustered Column Store Design Time Recommendations for Clustered Columnstore Indexes Clustered columnstore index is the preferred indexing type for tables in PDW. There are rare case where other index types make sense. These cases are out of scope of the this document. As previously discussed, clustered columnstore indexes enhance query performance through compression, fetching only needed columns, and segment elimination. All of these factors reduce the I/O needed to answer a query. There are some design considerations for optimizing CCI performance. The largest compression ratios are obtained when compressing columns of integer and numeric based data types. String data will compress, but not as well. Additionally, strings are stored in a dictionary with a 16MB size limit. If the strings in a rowgroup exceed this limit, the number of rows in the rowgroup will be reduced. Small rowgroups can reduce the effectiveness of segment elimination, causing more I/O. Finally, joining on string columns is still not efficient. Therefore, strings should be designed out of large fact tables and moved into dimension tables. The CCI will perform better, and it is a good dimensional modeling practice when designing a data warehouse. Under ideal conditions, a rowgroup will contain one million rows (actually 1,048,576). As data is inserted into a CCI table, it is accumulated in the deltastore portion of the table. When the deltastore reaches one million rows, it is moved into the columnstore by the tuple mover. During a bulk load, such as when using dwloader, the thresholds for a closed rowgroup are different. A batch of more than one hundred thousand rows will be loaded directly into the columnstore as one rowgroup. Batch sizes of less than 100K will be loaded to the deltastore. This means that if a CCI is routinely loaded in batches between 100 thousand and one million rows, the rowgroups will not be adequately sized. Rowgroups can be consolidated by rebuilding the index, but this can be time consuming and resource intensive. The problem can be avoided by loading data using a partition switching process as described later in this document. 20 Bulk loading data into a clustered columnstore index can be resource intensive due to the compression that needs to be performed. If a large load or complex query is not granted enough memory or is subject to memory pressure, this will cause rowgroups to be trimmed. In this case, it may be necessary to grant the load process more memory by executing it using an account that is a member of a larger resource class. Columnar storage formats improve performance by reading only the columns required to satisfy a query, which reduces system I/O significantly. A clustered columnstore index may not perform well for queries that select all columns or a large number of columns from a table. If all or most of the columns in a table are requested in a query, a traditional rowstore table may be more appropriate. The query may still benefit from the compression and segment elimination provided by a clustered columnstore index. As always, your specific workload should be tested. It is trivial to create a copy of a table using a CTAS statement to test different table storage formats. 21 Datatypes and Segment Elimination For reducing IO and increasing query performance, it is important to mention which data types support segment elimination. Data Type datetimeoffset datetime2 datetime smalldatetime date time float real decimal money smallmoney bigint int smallint tinyint bit nvarchar nchar varchar char varbinary binary Min and Max Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N Predicate Pushdown N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N Segment Elimination N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N Table 1: Data Types and Elimination support Partitioning By using partitioning, tables and indexes are physically divided horizontally, so that groups of data are mapped into individual partitions. Even though the table has been physically divided, the partitioned table is treated as a single logical entity when queries or updates are performed on the data. Capabilities Partitioning large tables within SQL Server Parallel Data Warehouse (PDW) can have the following manageability and performance benefits. 22 SQL Server PDW automatically manages the placement of data in the proper partitions. A partitioned table and its indexes appear as a normal database table with indexes, even though the table might have numerous partitions. Partitioned tables support easier and faster data loading, aging, and archiving. Using the sliding window approach and partition switching. Application queries that are properly filtered on the partition column can perform better by making use of partition elimination and parallelism. You can perform maintenance operation on partitions, efficiently targeting a subset of data to defragment a clustered index or rebuilding a clustered Columnstore index. SQL Server Parallel Data Warehouse (PDW) simplifies the creation of partitioned tables and indexes as you are no longer need to create a partition scheme and function. SQL Server PDW will automatically generate these and ensure that data is spread across the physical disks efficiently. For distributed tables, table partitions determine how rows are grouped and physically stored within each distribution. This means that data is first moved to the correct distribution before determining which partition the row will be physically stored. SQL Server lets you create a table with a constraint on the partitioning column and then switch that table into a partition. Since PDW does not support constraints, we do not support this type of partition switching method, but rather would require the source table to be partitioned with matching ranges. Recommendations for Partitioning For clustered columnstore indexes, every partition will contain a separate columnstore and deltastore. To achieve the best possible performance and maximize compression, it is recommended that you ensure each partition for each distribution is sized so that it contains more than 1 million rows. Be aware that a distributed table is already split into chunks of data (distributions) depending on the amount of compute nodes. The following query can be used to check the number of rows in each partition. SELECT t.name, pnp.index_id, pnp.partition_id, pnp.rows, pnp.data_compression_desc, pnp.pdw_node_id FROM sys.pdw_nodes_partitions AS pnp JOIN sys.pdw_nodes_tables AS NTables ON pnp.object_id = NTables.object_id AND pnp.pdw_node_id = NTables.pdw_node_id JOIN sys.pdw_table_mappings AS TMap ON NTables.name = TMap.physical_name JOIN sys.tables AS t ON TMap.object_id = t.object_id WHERE t.name = <TableName> ORDER BY t.name, pnp.index_id, pnp.partition_id; Partitioning Example The most common choice for partitioning a table is to choose a column tied to a date. For example, a fact table of orders could be partitioned by the date of the order. This helps facilitate data loading using partition switching and improves query performance if the date is used in a filter. As previously mentioned, care should be given to making sure each partition is not too small. While it may be tempting to partition a large table by day, this may not be necessary after distributing it across all distributions in the appliance. Extremely large fact tables can be partitioned by day, but partitioning at a larger grain, such as by month, is much more common. 23 Other common scenarios are to partition by week or year depending on the size of the table as well as the requirements around data loading and archiving. To partition a table, the WITH clause is added to the CREATE TABLE statement along with the partitioning information. A large fact would likely use a clustered columnstore index, be a distributed table, and be partitioned. For the orders fact table example, the WITH clause may resemble the following: WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH(OrderId), PARTITION (OrderDate RANGE RIGHT FOR VALUES (20150101,20150201,20150301) ) In the above example, OrderDate is an integer based key referencing the date of an order. (Note: If the column is a date datatype, specifiy the dates in single quotes in the form 'YYYYMM-DD'.) The partition statement creates four partitions, even though only three boundary values are specified. Because the partitioning is defined as RANGE RIGHT, the first partition contains all values less than 20150101. The next two partitions contain data for January and February of 2015 respectively. The last partition includes data from 20150301 and greater. This is illustrated in the following table. Partition Number Values 1 OrderDate < 20150101 2 20150101 <= OrderDate < 20150201 3 20150201 <= OrderDate < 20150301 4 OrderDate >= 20150301 When creating monthly partitions, it is easier to define the boundaries as RANGE RIGHT. If the above values were used with RANGE LEFT, the first day of the month would fall into the previous month’s partition. This would require the boundary values to be defined using the last day of each month. Because each month has a different number of days, it is much simpler to use the first day of each month and RANGE RIGHT. Data Loading with Partition Switching Loading data using a partition switching process allows the data to be processed separately from the existing table. The new data is manipulated while the target table is available to serve user queries. Once the data is processed, it is quickly switched into the target table in a metadata operation. The partition switching process requires data to be staged in a heap table first. A CREATE TABLE AS SELECT (CTAS) statement is then used to rebuild the target partition(s) by combining new data in the stage table with existing data in the final target table that needs to be kept. As an example, suppose an existing fact table is partitioned by transaction date. Each day, a daily file of new data is provided, which can include multiple days of transactions. In the following diagram, the existing fact table has files from June 5 and 6 that included data for both those transaction dates. A new file for June 7 is then staged into a heap table. This file includes 24 data for transactions on June 7, as well as late arriving transactions from the previous two days. The new fact table data needs to be merged with the existing fact data. Existing Fact Data Transaction Date 6/5/2015 6/5/2015 6/6/2015 File Load Date 6/5/2015 6/6/2015 6/6/2015 New Data Transaction Date 6/5/2015 6/6/2015 6/7/2015 File Load Date 6/7/2015 6/7/2015 6/7/2015 Staged Fact Data – Ready for Partition Switch Transaction Date 6/5/2015 6/5/2015 6/5/2015 6/6/2015 6/6/2015 6/7/2015 File Load Date 6/5/2015 6/6/2015 6/7/2015 6/6/2015 6/7/2015 6/7/2015 Figure 7: Merging New Data with Existing Data Merging these two datasets is accomplished in a CTAS statement by unioning the staged data with the existing fact data and specifying a WHERE clause for data that needs to be kept or excluded. Excluding data is done in scenarios where existing data would be updated for some reason. Any data transformations or surrogate key lookups for the staged data can be performed as part of this CTAS statement. Pseudocode for the CTAS process described above is shown here: CREATE TABLE FACT_CTAS <Table Geometry> AS SELECT * FROM FACT WHERE <Exclusion Criteria> UNION ALL SELECT <columns> FROM FACT_STAGE Note that if each load has its own separate partition, there is no need to union data because new data is not being merged into an existing partition with data. Once this new dataset has been created, a partition switching operation can be performed to move the data into the final table. The partition switching process requires three tables: the existing fact table, the table created in the CTAS statement, and a third table to move the old data. The steps for performing the switch process are: 1. Truncate the switch out table. 2. Switch a partition from the existing fact table to the switch out table. 3. Switch the corresponding partition from the CTAS table to the target fact table. 25 CTAS Table Target Fact Table 3 Switch Out Table 2 20150605 20150605 20150606 20150606 20150607 20150607 1 Figure 8: Partition Switching Sequence This process is then repeated for each partition that was rebuilt. The code would resemble the following: TRUNCATE TABLE FACT_SwitchOut SET @SwitchOut = 'ALTER TABLE FACT SWITCH PARTITION ' + @PartitionNbr + ' TO FACT_SwitchOut' EXEC (@SwitchOut) SET @SwitchIn = 'ALTER TABLE FACT_CTAS SWITCH PARTITION ' + @PartitionNbr + ' TO FACT PARTITION ' + @PartitionNbr EXEC (@SwitchIn) The above code can be executed in a loop by SSIS or a stored procedure. For each execution, a different partition number is assigned to the @PartitionNbr variable. Note that switching partitions uses the partition number as a parameter, not the value of the partitioning column. This means that the new data will need to be mapped to the appropriate partitions. For the typical case of date based partitioning that uses a smart integer key of the form YYYYMMDD, the following code will accomplish this mapping: SELECT P.partition_number, PF.boundary_value_on_right, CAST((LAG(PRV.value, 1, 19000101) over (order by ISNULL(PRV.value, 99991231))) as int) as Lower_Boundary_Value, CAST(ISNULL(PRV.value, 99991231) as int) as Upper_Boundary_Value FROM sys.tables T INNER JOIN sys.indexes I on T.object_id = I.object_id AND I.index_id <2 INNER JOIN sys.partitions P on P.object_id = T.object_id AND P.index_id = I.index_id INNER JOIN sys.partition_schemes PS on PS.data_space_id = I.data_space_id INNER JOIN sys.partition_functions PF ON PS.function_id = PF.function_id LEFT OUTER JOIN sys.partition_range_values PRV on PRV.function_id = PS.function_id AND 26 PRV.boundary_id = P.partition_number WHERE T.name = <table name> For large amounts of data, dwloader, combined with this CTAS and partition switching methodology, is the preferred method for loading APS. It offers fast and predictable performance. Additionally, it eliminates the need to perform index maintenance, as the partitions are constantly being rebuilt as part of the loading process. A rebuild of the index to consolidate small rowgroups is not needed during a separate maintenance window. There are some considerations for using this approach. While partition switching is a virtually instantaneous operation, it does require a brief schema lock on the table being altered. In environments with either long-running queries or very active query patterns, a schema lock may not be able to be obtained, causing the partition switching process to wait. If the ETL processes will not run during business hours, then this should not be a problem. Statistics SQL Server Parallel Data Warehouse (PDW) uses a cost based query optimizer and statistics to generate query execution plans to improve query performance. Up-to-date statistics ensure the most accurate estimates when calculating the cost of data movement and query operations. It is important to create statistics and update the statistics after each data load. Different Types of Statistics SQL Server PDW stores two sets of statistics at different levels within the appliance. One set exists at the control node and the other set exists on each of the compute nodes. Compute Node Statistics SQL Server PDW stores statistics on each of the compute nodes and uses them to improve query performance for the queries which execute on the compute nodes. Statistics are objects that contain statistical information about the distribution of values in one or more columns of a table. The cost based query optimizer uses these statistics to estimate the cardinality, or number of rows, in the query result. These cardinality estimates enable the query optimizer to create a high-quality query execution plan. The query optimizer could use cardinality estimates to decide on whether to select the index seek operator instead of the more resource-intensive index scan operator, improving query performance. Each compute node has the AUTO_CREATE_STATISTICS set to ON, which causes the query optimizer to create statistics based upon a single column that is referenced within the WHERE or ON clause of a query. AUTO_CREATE_STATISTICS is not configurable; it cannot be disabled. Also note that multi-column statistics are not automatically created. Control Node Statistics SQL Server PDW stores statistics on the control node and uses them to minimize the data movement in the distributed query execution plan. These statistics are internal to PDW and are not made available to client applications. In addition, the DMVs report only statistics on the compute nodes and do not report statistics on the control node. To create the control node statistics, PDW merges the statistics from each compute node in the appliance and stores them as a single statistics object on the control node. For distributed tables, PDW creates a statistics 27 object for each of the eight distributions on each of the compute nodes, and one statistics object on the control node for the entire table. For replicated tables, PDW creates a statistics object on each compute node. Because each compute node will contain the same statistics, the control node will only copy the statistics object from one compute node. Statistics can also be created on an external table. The data for an external table exists outside PDW, either in Hadoop or Azure blob storage. PDW only stores metadata for the table definition. When you create statistics on an external table, SQL Server PDW will first import the required data into a temporary table on PDW so that it can then compute the statistics. The results are stored on the control node. For extremely large external tables, this process can take significant time. It will be much faster to create sampled statistics, which only requires the sampled rows to be imported. Recommendations for Statistics Collection of Statistics Collecting statistics is performed based upon a default sample, custom sample percentage, filter selection, or a full table scan. Increasing the amount of data sampled improves the accuracy of the cardinality estimates at the expense of the amount of time required in which to calculate the statistics. For large tables, the default sampling can be used as a starting point, whereas FULLSCAN can be used for smaller tables. It is also recommended to perform a FULLSCAN for date (or smart date key) columns that are used in query predicates. Collection of Multi-Column Statistics Statistics on all join columns within a single join condition and for all aggregate columns within a GROUP BY clause will provide additional information for the optimizer to generate a better execution plan, such as density vectors. When creating multi-column statistics, the order of the columns in the statistics object does not matter. Minimum Requirements for Statistics Collection At a minimum, statistics should be created for each of the following: Distribution Column Partition Column Clustered Index Keys Non-Clustered Index Keys Join Columns Aggregate Columns Commonly used predicate columns Always ensure that statistics do not go stale by updating the statistics automatically as part of the transformation schedule or a separate maintenance schedule, depending on the volatility of the column or columns. Updating statistics is an important task for the dba. Some data movement may be necessary to satisfy a query. It is worth mentioning that the PDW engine and Data Movement Service (DMS) does an excellent job of minimizing data movement 28 operations using predicate pushdown, local/global aggregation, and table statistics. It is very important to keep PDW statistics up-to-date so the query optimizer can minimize data movement as much as possible. If you are using external tables, it is also required to create and maintain statistics to take advantage of predicate pushdown. For example, when an external table is created over a Hadoop data source, statistics will be used to determine if all of the data should be imported from Hadoop and processed in PDW, or if map-reduce jobs should be created to process the data in Hadoop and return the results. Conclusion The reference architecture for optimizing distributed database design on the Microsoft Analytics Platform System is intended to provide guidance for designing a distributed database solution with Microsoft APS. While customers can benefit from the shorter time to market of a data warehouse solution implemented on APS, care should be given to make sure the solution is designed to take advantage of the MPP architecture. A key point in designing for APS is how to spread the data over the appliance. Choosing the correct table type, replicated or distributed, and the correct distribution column for distributed tables is critical for unlocking the performance of the appliance. Additionally, indexes need to be chosen to satisfy query requirements, large tables should be partitioned to support loading and querying, and statistics must be created and maintained so the query optimizer can ensure good query plans are chosen. With careful consideration of the points in this document, APS can provide insight into huge volumes of data without sacrificing data loading or querying performance. For more information: http://www.microsoft.com/sqlserver/: SQL Server Web site http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter 29