Reference Architecture for a Distributed Database

Optimizing Distributed Database Design for Analytics Platform System
Summary: This reference architecture is intended to provide proven practices for designing and
implementing a distributed data warehouse with the Microsoft Analytics Platform system. The
APS appliance is designed using a shared-nothing, massively parallel processing architecture,
capable of loading and processing very large data volumes in an efficient manner. The contents
of this paper are derived from a series of real-world deployments of the APS appliance for
Microsoft customers.
Writers: Michael Hlobil, Ryan Mich
Technical Reviewers: Charles Feddersen, Mary Long, Brian Mitchell
Published: June 2015
Applies to: Analytics Platform System
Copyright
This document is provided “as-is”. Information and views expressed in this document, including
URL and other Internet Web site references, may change without notice. You bear the risk of
using it.
Some examples depicted herein are provided for illustration only and are fictitious. No real
association or connection is intended or should be inferred.
This document does not provide you with any legal rights to any intellectual property in any
Microsoft product. You may copy and use this document for your internal, reference purposes.
© 2014 Microsoft. All rights reserved.
2
Contents
Introduction ................................................................................................................................ 4
Introduction to Analytics Platform System .................................................................................. 4
Overview of Analytics Platform System (APS) ........................................................................ 4
Shared Nothing Architecture of APS ....................................................................................... 5
APS Components ................................................................................................................... 6
MPP Engine ........................................................................................................................ 6
Data Movement Service (DMS) ........................................................................................... 6
SQL Server Databases ....................................................................................................... 6
Value Proposition of APS ....................................................................................................... 6
Handled Scenarios ................................................................................................................. 7
Greenfield DWH Design ...................................................................................................... 7
Migration of an Existing DWH ............................................................................................. 7
Reference Architecture for a Distributed Database / Datamart Design ....................................... 8
Overview of the Design Components ..................................................................................... 8
Database ............................................................................................................................ 8
APS Table Types ................................................................................................................... 9
Clustered Indexes .................................................................................................................16
Non-Clustered Indexes ..........................................................................................................18
Clustered Columnstore Index ................................................................................................18
Partitioning ............................................................................................................................22
Capabilities ........................................................................................................................22
Recommendations for Partitioning .....................................................................................23
Data Loading with Partition Switching ................................................................................24
Statistics ................................................................................................................................27
Recommendations for Statistics .........................................................................................28
Conclusion ................................................................................................................................29
3
Introduction
In this whitepaper, we will describe a reference architecture for the design of the Analytics
Platform System (APS). There are important decisions which have to be made in the early
stages of solution design. This white paper provides guidance to these design decisions,
potentially reducing the time to market of your data warehouse (DWH) solution.
The following two implementation scenarios are described in this paper:
•
Greenfield Data Warehouse implementation
o
Design and develop a logical data model and implement the resulting dimension
models as tables in APS.
•
Migration of an existing Data Warehouse to APS
o
Redesign of an existing symmetric multi-processor (SMP) solution, such as SQL
Server, to a distributed architecture for APS.
This paper describes a number of design components and considerations for a distributed data
warehouse, and provides explanations on why certain design patterns are optimal for this scaleout platform.
Introduction to Analytics Platform System
Overview of Analytics Platform System (APS)
APS is a massively parallel processing (MPP) data warehousing and analytics platform
appliance built for processing large volumes of relational and semi-structured data. It provides
seamless integration to Hadoop and Azure Blob Storage via Polybase (an integration layer
included in APS), which enables non-relational storage to interoperate with relational storage
through a unified language, SQL, and query optimizer.
APS ships in an appliance form factor comprised of factory built and configured hardware with
the APS software preinstalled. One benefit of this appliance model is the ability to quickly add
incremental resources that enable near linear performance gains. This is accomplished by
adding scale units to the appliance, which consist of servers and storage pre-tuned for optimal
performance.
The APS appliance is designed to support two different regions. The mandatory region is the
parallel data warehouse (PDW) region, which supports relational storage and is a MPP-cluster
of SQL servers tuned for data warehouse workloads. The optional region supports the Microsoft
distribution of Hadoop, known as HDInsight. This whitepaper focuses exclusively on the design
considerations for the MPP-RDBMS, the PDW region.
APS is generally used as the central data store for a modern data warehouse architecture. It
parallelizes and distributes the processing across multiple SMP (Symmetric-Multi-Processing)
4
compute nodes, each running an instance of Microsoft SQL Server. SQL Server Parallel Data
Warehouse is only available as part of Microsoft’s Analytics Platform System (APS) appliance.
The implementation of Massively Parallel Processing (MPP) in APS is the coordinated
processing of a single task by multiple processers, each working on a different part of the task.
With each processor using its own operating system (OS), memory, and disk. The nodes within
the MPP appliance communicate between each other using a high-speed Infiniband network.
Symmetric Multi-Processing (SMP/NUMA) is the primary architecture employed in servers. An
SMP architecture is a tightly coupled multiprocessor system, where processors share a single
copy of the operating system (OS) and resources that often include a common bus, memory,
and an I/O system. The typical single server, with multi-core processors, locally attached
storage, and running the Microsoft Windows OS is an example of an SMP server.
Shared Nothing Architecture of APS
APS follows the shared-nothing architecture; each processor has its own set of disks. Data in a
table can be “distributed” across nodes, such that each node has a subset of the rows from the
table in the database. Each node is then responsible for processing only the rows on its own
disks.
In addition, every node maintains its own lock table and buffer pool, eliminating the need for
complicated locking and software or hardware consistency mechanisms. Because shared
nothing does not typically have a severe bus or resource contention it can be made to scale
massively.
5
Figure 1: Region PDW
APS Components
MPP Engine
The MPP Engine runs on the control node. It is the brain of the SQL Server Parallel Data
Warehouse (PDW) and delivers the Massively Parallel Processing (MPP) capabilities. It
generates the parallel query execution plan and coordinates the parallel query execution across
the compute nodes. It also stores metadata and configuration data for the PDW region.
Data Movement Service (DMS)
The data movement service (DMS) moves data between compute nodes and between the
compute nodes and the control node. It bridges the shared nothing with the shared world.
SQL Server Databases
Each compute node runs an instance of SQL Server to process queries and manage user data.
Value Proposition of APS
An APS appliance consists of one or more physical server racks and is designed to be moved
into an existing customer datacenter as a complete unit.
Buying an APS appliance has the following advantages in the overall data warehouse life cycle:
Reduced project cost / time to market
6


Project duration is significantly minimized due to higher automation of processes in the
appliance.
The appliance is preconfigured and ready for use.
Reduced operational cost


Tooling for monitoring, high availability, and failover is available out of the box.
The scalability of the appliance is managed by adding resources.
Reduced database administration efforts

For DBAs, APS provides a reduced and documented set of administrative tasks for
keeping the solution at peak performance.
Handled Scenarios
The following scenarios are based on the definition of OLAP or a data warehouse (DWH):

A data warehouse is a copy of transaction data specifically structured for query and
analysis.
There are two most commonly discussed models when designing a DWH, the Kimball model
and the Inmon model.
In Kimball’s dimensional design approach, the datamarts supporting reports and analysis are
created first. The datasources and rules to populate this dimensional model are then identified.
In contrast, Immon’s approach defines a normalized data model first. The dimensional
datamarts to support reports and analysis are created from the data warehouse. In this white
paper, the focus will be on the Kimball dimensional model, as this layer exists in both design
approaches. The dimensional model is where the most query performance can be gained.
Greenfield DWH Design
For a greenfield data warehouse project, a logical data model is developed for the different
subject area data marts in accordance to the building blocks described by Kimball. The resulting
dimension model is implemented as tables in APS.
Migration of an Existing DWH
The migration of an existing DWH can be done in the following ways.



7
1:1 Migration – The existing data model and schema is moved directly to the APS PDW
region with little or no change to the model.
Redesign – The data model is redesigned and the application is re-architected, following
the SQL Server PDW best practices. This can help reduce overall complexity and enable
the ability for the data warehouse to be more flexible and answer any question at any
time.
Evolution – The evolutionary approach is an enhancement to the above two, allowing to
take advantage of the new capabilities of the new target platform.
In this white paper, we will describe approaches applicable for all migration scenarios.
Reference Architecture for a Distributed Database / Datamart Design
Overview of the Design Components
When optimizing a design for PDW, various components need to be considered. These include
databases, tables, indexes, and partitions. We will describe design and runtime
recommendations for each of these.
Database
When creating a PDW database, the database is created on each compute node and on the
control node of the appliance. The database on the control node is called a shell database
because it only holds the metadata of the database and no user data. Data is only stored on the
compute nodes. The appliance automatically places the database files on the appropriate disks
for optimum performance, and therefore the file specification and other file related parameters
are not needed. The only parameters specified when creating a database are:
REPLICATED_SIZE
The amount of storage in GB allocated for replicated tables. This is the amount of storage
allocated on a per compute node basis. For example, if the REPLICATED_SIZE is 10 GB, 10
GB will be allocated on each compute node for replicated tables. If the appliance has eight
compute nodes, a total of 80 GB will be allocated for replicated tables (8 * 10 = 80).
DISTRIBUTED_SIZE
The amount of data in GB allocated for distributed tables. This is the amount of data allocated
over the entire appliance. Per compute node, the disk space allocated equals the
DISTRIBUTED_SIZE divided by the number of compute nodes in the appliance. For example, if
the DISTRIBUTED_SIZE is 800 GB and the appliance has eight compute nodes, each compute
node has 100 GB allocated for distributed table storage (800 / 8 = 100).
LOG_SIZE
The amount of data in GB allocated for the transaction log. This is the amount of data allocated
over the entire appliance. Like the DISTRIBUTED_SIZE parameter, the disk space allocated per
compute node equals the LOG_SIZE divided by the number of compute nodes in the appliance.
AUTOGROW
An optional parameter that specifies if the REPLICATED_SIZE, DISTRIBUTED_SIZE, and
LOG_SIZE is fixed or if the appliance will automatically increase the disk allocation for the
database as needed until all of the physical disk space in the appliance is consumed. The
default value for AUTOGROW is OFF.
Design Time Recommendations for Databases
Sizing
The size of the database should be based on the calculated sizes of the replicated and
distributed tables. In addition, it is recommended that database growth be estimated for two
8
years and included with the initial database size calculation. This is the baseline for the size of
the database. Additional space is also needed for the following reasons:




To make a copy of the largest distributed table if needed (e.g. for testing different
distribution columns)
To have space for at least a few extra partitions if you are using partitioning
To store a copy of a table during an index rebuild
To account for deleted rows in a clustered columnstore index (deleted rows are not
physically deleted until the index is rebuilt)
These simple guidelines will provide sufficient insight into sizing the database.
Autogrow
It is recommended that the database is created large enough before loading data, and that
autogrow is turned off. Frequent database resizing can impact data loading and query
performance due to database fragmentation.
APS Table Types
SQL Server Parallel Data Warehouse (PDW) is a Massively Parallel Processing (MPP)
appliance which follows the Shared Nothing Architecture. This means that the data is spread
across the compute nodes in order to benefit from the storage and query processing of the MPP
architecture (divide and conquer paradigm). SQL Server PDW provides two options to define
how the data can be distibuted: distributed and replicated tables.
Distributed Table
A distributed table is a table in which all rows have been spread across the SQL Server PDW
compute nodes based upon a row hash function. Each row of the table is placed on a single
distribution as assigned by a deterministic hash algorithm taking as input the value contained
within the defined distribution column. The following diagram depicts how rows would typically
be stored within a distributed table.
9
Figure 2: Distributed Table
Each SQL Server PDW compute node has eight distributions and each distribution is stored on
its own set of disks, therefore ring-fencing the I/O resources. Distributed tables are what gives
SQL Server PDW the ability to scale out the processing of a query across multiple compute
nodes. Each distributed table has one column which is designated as the distribution column.
This is the column that SQL Server PDW uses to assign a distributed table row to a distribution.
Design Time Recommendations for Distributed Tables
There are performance considerations for the selection of a distribution column:



Data Skew
Distinctness
Types of queries run on the appliance
When selecting a distribution column, there are performance considerations, one of which is
data skew. Skew occurs when the rows of a distributed table are not spread uniformly across
each distribution. When a query relies on a distributed table which is skewed, even if a smaller
distribution completes quickly, you will still need to wait for the queries to finish on the larger
distributions. Therefore, a parallel query performs as slow as the slowest distribution, and so it is
essential to avoid data skew. Where data skew is unavoidable, PDW can endure 20 - 30% skew
between the distributions with minimal impact to the queries. We recommend the following
approach to find the optimal column for distribution.
Distribution Key
The first step in designing the data model is to identify the optimal distribution, otherwise, the
benefits of the Massive Parallel Processing (MPP) architecture will not be realized. Therefore,
always begin with designing distributed tables. Designing the data model with the optimal
distribution key helps minimize as much movement of data as possible between compute nodes
to optimize the divide and conquer paradigm.
10
Rules for Identifying the Distribution Column
Selecting a good distribution column is an important aspect to maximizing the benefits of the
Massive Parallel Processing (MPP) architecture. This is possibly the most important design
choice you will make in SQL Server Parallel Data Warehouse (PDW). The principal criteria you
must consider when selecting a distribution column for a table are the following:



Access
Distribution
Volatility
The ideal distribution key is one that meets all three criteria. The reality is that you are often
faced with trading one off from the other.





11
Select a distribution column that is also used within query join conditions –
Consider a distribution column that is also one of the join conditions between two
distributed tables within the queries being executed. This will improve query
performance by removing the need to move data, making the query join execution
distribution local, fulfilling the divide and conquer paradigm.
Select a distribution column that is also aggregation compatible –
Consider a distribution column that is also commonly used within the GROUP BY clause
within the queries being executed. The order in which the GROUP BY statement is
written doesn’t matter as long as the distribution column is used. This will improve query
performance by removing the need to move data to the control node for a two-step
aggregation. Instead, the GROUP BY operation will execute locally on each distribution,
fulfilling the divide and conquer paradigm.
Select a distribution column that is frequently used in COUNT DISTINCT’s –
Consider a distribution column that is commonly used within a COUNT DISTINCT
function. This will improve query performance by removing the need to redistribute data
via a SHUFFLE MOVE operation, making the query COUNT DISTINCT function execute
distribution local.
Select a distribution column that provides an even data distribution –
Consider a distribution column that can provide an even number of rows per distribution,
therefore balancing the resource requirements. To achieve this, look for a distribution
column that provides a large number of distinct values, i.e. at least 10 times the number
of table distributions. It is also important to check if the selected column leads to skew.
Data skew, as already mentioned above, occurs when the rows of a distributed table are
not spread uniformly across all of the distributions.
Select a distribution column that rarely, if ever changes value –
PDW will not allow you to change the value of a distribution column for a given row by
using an UPDATE statement, because changing the value of a distribution column for a
given row will almost certainly mean the row will be moved to a different distribution. It is
therefore recommended that you select a distribution column which rarely requires the
value to be modified.
On some occasions, the choice of a distribution column may not be immediately obvious, and
certain query patterns may not perform well when the distribution column is chosen using the
general guidelines above. In these one-off scenarios, exercising some creativity in the physical
database design can lead to performance gains. The following sections illustrate a few
examples of these techniques that may not be readily apparent.
Improving Performance with a Redundant Join Column
If you have queries with multiple joins, there is an easy way to increase performance by adding
an additional join column. As an example let’s assume there are three tables A, B and C. Table
A and B are getting joined over column X and table B and C getting joined over column Y. By
adding column X to table C and distributing all tables over column X, you increase the
performance by making the join compatible because all tables are distributed over the same
distribution column.
A common example is needing to cascade the distribution column to child tables. For instance,
it may make sense in a customer-centric data warehouse to distribute fact tables by customer.
There may be an Order Header table and a child Order Detail table. In a traditional normalized
schema, the Order Header table would have a relationship to the Customer table, and the Order
Detail table would join to the Order Header table using the order number. In this scenario, the
join between the header and detail tables is not distribution local. The customer key should be
added to the detail table. Even though this column is not necessary from a data modeling
perspective, it allows both order tables to share the same distribution column. Joins between
Order Header and Order Detail should be on both the order number and the customer key so
the join is distribution local. It should be noted that even if two tables have the same distribution
column, they explicitly need to be joined on that column for the join to be distribution compatible.
High Cardinality Columns and Distinct Count
If you have queries including aggregations (like distinct count) over different columns having
high cardinality, i.e. lots of different values, performance will vary greatly depending on which
column is queried. For example, if a table is distributed over column X, a distinct count of
column X will execute locally on each distribution, because a given value for column X can only
exist in one distribution. The results of these parallel distinct counts will then be summed to
calculate the distinct count across the entire table. However, a distinct count on a different
column, column Y for example, will perform poorly. In the case of column Y (or any column but
the distribution column), significant data movement can occur, slowing query response time. Full
parallelism and full performance will only work for the distributed column.
As an alternative, the same table can be created twice, with each table adopting a different
distribution column. For instance, Table A can be distributed over column X and Table B can be
distributed over column Y. A view consisting of two separate select statements in a union can
then be created over these two tables. The view performs the distinct counts for each column
based on the separate tables without any performance decrease.
The following pseudocode illustrates how this could be implemented for a sample fact table. A
set of common surrogate keys are included in each select statement, and the distinct count is
performed against the distribution column of each table. This ensures that each select
12
statement executes in parallel across distributions for maximum performance The other column
is set to zero in each select statement. Combining the results of each select in a union followed
by an outer sum returns the complete result set, the equivalent of running multiple distinct
counts against the original table.
SELECT SUM(Count_X) as Count_X, SUM(Count_Y) as Count_Y
FROM (
SELECT
<surrogate keys>, COUNT(DISTINCT X) as Count_X, 0 as Count_Y
FROM Table_A
UNION ALL
SELECT
<surrogate keys>, 0 as Count_X, COUNT(DISTINCT Y) as Count_Y
FROM Table_B
) X
GROUP BY
<surrogate keys>
This code could be implemented in a view that is consumed by a BI tool so the distinct counts
can be performed across multiple tables optimized for performance. Distinct counts of additional
columns can be calculated by adding more tables to the view as needed, with each table
distributed on the column being aggregated. There will be a loading and storage penalty of
needing to maintain multiple tables, but query performance will be optimized, which is the end
goal of a reporting and analytics solution.
Distribution Columns with NULL Values
In some cases, the ideal distribution column for satisfying queries has null records. All of the
rows with a null value for the distribution column would be placed on the same distribution. If the
column has a high percentage of nulls, the table will be heavily skewed and cause poor query
performance. While the best solution may be to choose a different distribution column, this may
introduce other problems. For example, if multiple fact tables all share the same distribution key,
joining fact tables will be distribution local. Choosing a different distribution column for one fact
table because it contains null records would make joins no longer distribution local, causing data
movement.
To address this issue, the table can be split into two physical tables. One table will have the
same schema as the original table and be distributed over the nullable column. The second
table will have the same schema as well, but will be distributed using a different column. The
ETL process for loading the table will need to be modified. All rows having a non-null value for
the distribution column will be loaded into the first table. All rows with a null distribution column
will go into the second table, which is distributed on a different column. This will result in two
tables that are both evenly distributed. Finally, these two separate tables can be reassembled
for the user in a view having the name of the original table.
Replicated Table
A replicated table has a complete replica of all data stored on each of the SQL Server PDW
compute nodes. Replicating a table onto each compute node provides an additional optimization
by removing the need to shuffle data between distributions before performing a join operation
against a distributed table. This can be a huge performance boost when joining lots of smaller
13
tables to a much larger tables, as would be the case with a dimensional data model. The fact
table is distributed on a high cardinality column, while the dimensional tables are replicated onto
each compute node.
Figure 3: Replicated Table
The above diagram depicts how a row would be stored within a replicated table. A replicated
table is striped across all of the disks assigned to each of the distributions within a compute
node.
Because you are duplicating all data for each replicated table on each compute node, you will
require extra storage, equivalent to the size of a single table multiplied by the number of
compute nodes in the appliance. For example, a table containing 100MB of data on a PDW
appliance with 10 compute nodes will require 1GB of storage.
Design Time Recommendations for Replicated Tables
Replicated tables should be viewed as an optimization technique, much in the same way you
would consider indexing or partitioning.
Identifying Candidates for Replicated Tables
Replicating a table makes all joins with the table distribution compatible. The need to perform
data movement is removed at the expense of data storage and load performance.
The ideal candidate for replicating a table is one that is small in size, changes infrequently, and
has been proven to be distribution incompatible. As a rough guideline, a “small” table is less
than 5GB. However, this value is just a starting point. Larger tables can be replicated if it
improves query performance. Likewise, smaller tables are sometimes distributed if it helps with
data processing.
14
Figure 4: Sample Dimensional Model
The dimensional model above depicts one fact table and four dimension tables. Within this
example, the customer dimension is significantly larger in size than all other dimensions and
would benefit from remaining distributed. The “Customer ID” column provides good data
distribution (minimal data skew), is part of the join condition between the fact table and the
customer dimension creating a distribution compatible join, and is a regular candidate to be
within the GROUP BY clause during aggregation. Therefore, selecting “Customer ID” as the
distribution column would be a good choice. All other dimensions are small in size, so we have
made them replicated, eliminating the need to redistribute the fact table, and thus all data
movement, when joining the fact table to any dimension table(s).
Run Time Recommendations for Replicated Tables
When loading new data or updating existing data, you will require far more resources to
complete the task than if it were to be executed against a distributed table because the
operation will need to be executed on each compute node. Therefore, it is essential that you
take into account the extra overhead when performing ETL/ELT style operations against a
replicated table.
The most frequently asked question about replicated tables is “What is the maximum size we
should consider for a replicated table?” It is recommend that you keep the use of replicated
tables to a small set of data, tables less than 5GB in size is a good rule of thumb. However,
there are scenarios when you may want to consider much larger replicated tables. For these
special scenarios, we would recommend you follow one of these approaches to reduce the
overall batch resource requirements.
15


Maintain two versions of the same table, one distributed in which to perform all of the
ETL/ELT operations against, and the second replicated for use by the BI presentation
layer. Once all of the ETL/ELT operations have completed against the distributed version
of the table, you would then execute a single CREATE TABLE AS SELECT (CTAS)
statement in which to rebuild a new version of the final replicated table.
Maintain two tables, both replicated. One “base” table persists all data, minus the current
week’s set of changes. The second “delta” table persists the current week’s set of
changes. A view is created over the “base” and the “delta” replicated tables which
resolves the two sets of data into the final representation. A weekly schedule is required,
executing a CTAS statement based upon the view to rebuild a new version of the “base”
table having all of the data. Once complete, the “delta” table can be truncated and can
begin collecting the following week’s changes.
Clustered Indexes
A clustered index physically orders the rows of the data in the table. If the table is distributed,
then the physical ordering of the data pages is applied to each of the distributions individually. If
the table is replicated, then the physical ordering of the rows is applied to each copy of the
replicated table on each of the compute nodes. In other words, the rows in the physical structure
are sorted according to the fields that correspond to the columns used in the index.
You can only have one clustered index on a table, because the table cannot be ordered in more
than one direction. The following diagram shows what a Clustered Index table might look like for
a table containing city names:
16
Figure 5: Clustered Index
Design Time Recommendations for Clustered Indexes
When defining a clustered index on a distributed table within PDW, it is not necessary to include
the distribution column within the cluster key. The distribution column can be defined to optimize
the joins while the clustered index can be defined to optimize filtering. Clustered indexes should
be selected over non-clustered indexes when queries commonly return large result sets.
Clustered indexes within PDW provides you with a way to optimize a number of standard online
analytical processing (OLAP) type queries, including the following:
Predicate Queries
Data within the table is clustered on the clustering column(s) and an index tree is constructed
for direct access to these data pages, therefore clustered indexes will provide the minimal
amount of I/O required in which to satisfy the needs of a predicate query. Consider using a
clustered index on column(s) commonly used as a valued predicate.
Range Queries
All data within the table is clustered and ordered on the clustering column(s), which provides an
efficient method to retrieve data based upon a range query. Therefore, consider using a
clustered index on column(s) commonly used within a range predicate, for example, where a
given date is between two dates.
17
Aggregate Queries
All data within the table is clustered and ordered on the clustering column(s), which removes the
need for a sort to be performed as part of an aggregation or a COUNT DISTINCT. Therefore,
consider using a clustered index on column(s) commonly contained within the GROUP BY
clause or COUNT DISTINCT function.
Non-Clustered Indexes
Non-clustered indexes are fully independent of the underlying table and up to 999 can be
applied to both heap and clustered index tables. Unlike clustered indexes, a non-clustered index
is a completely separate storage structure. On the index leaf page, there are pointers to the
data pages.
Design Recommendations for Non-Clustered Indexes
Non-clustered indexes are generally not recommended for use with PDW. Because they are a
separate structure from the underlying table, loading a table with non-clustered indexes will be
slower because of the additional I/O required to update the indexes. While there are some
situations that may benefit from a covering index, better overall performance can usually be
obtained using a clustered columnstore index, described in the next section.
Clustered Columnstore Index
Clustered columnstore indexes (CCI) use a technology called xVelocity for the storage, retrieval,
and management of data within a columnar data format, which is known as the columnstore.
Data is compressed, stored, and managed as a collection of partial columns called segments.
Some of the clustered columnstore index data is stored temporarily within a row-store table,
known as a delta-store, until it is compressed and moved into the columnstore. The clustered
columnstore index operates on both the columnstore and the delta-store when returning the
query results.
Capabilities
The following CCI capabilities support an efficient distributed database design:







18
Includes all columns in the table and is the method for storing the entire table.
Can be partitioned.
Uses the most efficient compression, which is not configurable.
Does not physically store columns in a sorted order. Instead, it stores data in the order it
is loaded to improve compression and performance. Pre-sorting of data can be achieved
by creating the clustered columnstore index table from a clustered index table or by
importing sorted data. Sorting the data may improve query performance if it reduces the
number of segments needing to be scanned for certain query predicates.
SQL Server PDW moves the data to the correct location (distribution, compute node)
before adding it to the physical table structure.
For a distributed table, there is one clustered columnstore index for every partition of
every distribution.
For a replicated table, there is one clustered columnstore index for every partition of the
replicated table on every compute node.
Advantages of the Clustered Columnstore Index
SQL Server Parallel Data Warehouse (PDW) takes advantage of the column based data layout
to significantly improve compression rates and query execution time.






Columns often have similar data across multiple rows, which results in high compression
rates.
Higher compression rates improve query performance by requiring less total I/O
resources and using a smaller in-memory footprint.
A smaller in-memory footprint allows for SQL Server PDW to perform more query and
data operations in-memory.
Queries often select only a few columns from a table. Less I/O is needed because only
the columns needed are read, not the entire row.
Columnstore allows for more advanced query execution to be performed by processing
the columns within batches, which reduces CPU usage.
Segment elimination reduces I/O – Each distribution is broken into one million rows
called segments. Each segment has metadata that stores the minimum and maximum
value of each column for the segment. The storage engine checks filter conditions
against the metadata. If it can detect that no rows will qualify, then it skips the entire
segment without even reading it from disk.
Technical Overview
The following are key terms and concepts that you will need to know in order to better
understand how to use clustered columnstore indexes.






19
Rowgroup – Group of rows that are compressed into columnstore format at the same
time. Each column in the rowgroup is compressed and stored separately on the physical
media. A rowgroup can have a maximum of 220 (1,048,576) rows, nominally one million.
Segment – A segment is the basic storage unit for a columnstore index. It is a group of
column values that are compressed and physically stored together on the physical
media.
Columnstore – A columnstore is data that is logically organized as a table with rows and
columns physically stored in a columnar data format. The columns are divided into
segments and stored as compressed column segments.
Rowstore – A rowstore is data that is organized as rows and columns, and then
physically stored in a row based data format.
Deltastore – A deltastore is a rowstore table that holds rows until the quantity is large
enough to be moved into the columnstore. When you perform a bulk load, most of the
rows will go directly to the columnstore without passing through the deltastore. Some
rows at the end of the bulk load might be too few in number to meet the minimum size of
a rowgroup. When this happens, the final rows go to the deltastore instead of the
columnstore.
Tuple Mover – Background process which automatically moves data from the “CLOSED”
deltastore into compressed column segments of the columnstore. A rowgroup is marked
as closed when it contains the maximum number of rows allowed.
Figure 6: Clustered Column Store
Design Time Recommendations for Clustered Columnstore Indexes
Clustered columnstore index is the preferred indexing type for tables in PDW. There are rare
case where other index types make sense. These cases are out of scope of the this document.
As previously discussed, clustered columnstore indexes enhance query performance through
compression, fetching only needed columns, and segment elimination. All of these factors
reduce the I/O needed to answer a query. There are some design considerations for optimizing
CCI performance.
The largest compression ratios are obtained when compressing columns of integer and numeric
based data types. String data will compress, but not as well. Additionally, strings are stored in a
dictionary with a 16MB size limit. If the strings in a rowgroup exceed this limit, the number of
rows in the rowgroup will be reduced. Small rowgroups can reduce the effectiveness of segment
elimination, causing more I/O. Finally, joining on string columns is still not efficient. Therefore,
strings should be designed out of large fact tables and moved into dimension tables. The CCI
will perform better, and it is a good dimensional modeling practice when designing a data
warehouse.
Under ideal conditions, a rowgroup will contain one million rows (actually 1,048,576). As data is
inserted into a CCI table, it is accumulated in the deltastore portion of the table. When the
deltastore reaches one million rows, it is moved into the columnstore by the tuple mover. During
a bulk load, such as when using dwloader, the thresholds for a closed rowgroup are different. A
batch of more than one hundred thousand rows will be loaded directly into the columnstore as
one rowgroup. Batch sizes of less than 100K will be loaded to the deltastore. This means that if
a CCI is routinely loaded in batches between 100 thousand and one million rows, the rowgroups
will not be adequately sized. Rowgroups can be consolidated by rebuilding the index, but this
can be time consuming and resource intensive. The problem can be avoided by loading data
using a partition switching process as described later in this document.
20
Bulk loading data into a clustered columnstore index can be resource intensive due to the
compression that needs to be performed. If a large load or complex query is not granted enough
memory or is subject to memory pressure, this will cause rowgroups to be trimmed. In this case,
it may be necessary to grant the load process more memory by executing it using an account
that is a member of a larger resource class.
Columnar storage formats improve performance by reading only the columns required to satisfy
a query, which reduces system I/O significantly. A clustered columnstore index may not perform
well for queries that select all columns or a large number of columns from a table. If all or most
of the columns in a table are requested in a query, a traditional rowstore table may be more
appropriate. The query may still benefit from the compression and segment elimination provided
by a clustered columnstore index. As always, your specific workload should be tested. It is trivial
to create a copy of a table using a CTAS statement to test different table storage formats.
21
Datatypes and Segment Elimination
For reducing IO and increasing query performance, it is important to mention which data types
support segment elimination.
Data Type
datetimeoffset
datetime2
datetime
smalldatetime
date
time
float
real
decimal
money
smallmoney
bigint
int
smallint
tinyint
bit
nvarchar
nchar
varchar
char
varbinary
binary
Min and Max
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
N
N
Predicate
Pushdown
N
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
N
N
N
N
N
N
Segment
Elimination
N
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
N
N
N
N
N
N
Table 1: Data Types and Elimination support
Partitioning
By using partitioning, tables and indexes are physically divided horizontally, so that groups of
data are mapped into individual partitions. Even though the table has been physically divided,
the partitioned table is treated as a single logical entity when queries or updates are performed
on the data.
Capabilities
Partitioning large tables within SQL Server Parallel Data Warehouse (PDW) can have the
following manageability and performance benefits.



22
SQL Server PDW automatically manages the placement of data in the proper partitions.
A partitioned table and its indexes appear as a normal database table with indexes, even
though the table might have numerous partitions.
Partitioned tables support easier and faster data loading, aging, and archiving. Using the
sliding window approach and partition switching.


Application queries that are properly filtered on the partition column can perform better
by making use of partition elimination and parallelism.
You can perform maintenance operation on partitions, efficiently targeting a subset of
data to defragment a clustered index or rebuilding a clustered Columnstore index.
SQL Server Parallel Data Warehouse (PDW) simplifies the creation of partitioned tables and
indexes as you are no longer need to create a partition scheme and function. SQL Server PDW
will automatically generate these and ensure that data is spread across the physical disks
efficiently.
For distributed tables, table partitions determine how rows are grouped and physically stored
within each distribution. This means that data is first moved to the correct distribution before
determining which partition the row will be physically stored.
SQL Server lets you create a table with a constraint on the partitioning column and then switch
that table into a partition. Since PDW does not support constraints, we do not support this type
of partition switching method, but rather would require the source table to be partitioned with
matching ranges.
Recommendations for Partitioning
For clustered columnstore indexes, every partition will contain a separate columnstore and
deltastore. To achieve the best possible performance and maximize compression, it is
recommended that you ensure each partition for each distribution is sized so that it contains
more than 1 million rows. Be aware that a distributed table is already split into chunks of data
(distributions) depending on the amount of compute nodes. The following query can be used to
check the number of rows in each partition.
SELECT t.name, pnp.index_id, pnp.partition_id, pnp.rows,
pnp.data_compression_desc, pnp.pdw_node_id
FROM sys.pdw_nodes_partitions AS pnp
JOIN sys.pdw_nodes_tables AS NTables
ON pnp.object_id = NTables.object_id
AND pnp.pdw_node_id = NTables.pdw_node_id
JOIN sys.pdw_table_mappings AS TMap
ON NTables.name = TMap.physical_name
JOIN sys.tables AS t
ON TMap.object_id = t.object_id
WHERE t.name = <TableName>
ORDER BY t.name, pnp.index_id, pnp.partition_id;
Partitioning Example
The most common choice for partitioning a table is to choose a column tied to a date. For
example, a fact table of orders could be partitioned by the date of the order. This helps facilitate
data loading using partition switching and improves query performance if the date is used in a
filter. As previously mentioned, care should be given to making sure each partition is not too
small. While it may be tempting to partition a large table by day, this may not be necessary after
distributing it across all distributions in the appliance. Extremely large fact tables can be
partitioned by day, but partitioning at a larger grain, such as by month, is much more common.
23
Other common scenarios are to partition by week or year depending on the size of the table as
well as the requirements around data loading and archiving.
To partition a table, the WITH clause is added to the CREATE TABLE statement along with the
partitioning information. A large fact would likely use a clustered columnstore index, be a
distributed table, and be partitioned. For the orders fact table example, the WITH clause may
resemble the following:
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(OrderId),
PARTITION (OrderDate RANGE RIGHT FOR VALUES
(20150101,20150201,20150301)
)
In the above example, OrderDate is an integer based key referencing the date of an order.
(Note: If the column is a date datatype, specifiy the dates in single quotes in the form 'YYYYMM-DD'.) The partition statement creates four partitions, even though only three boundary
values are specified. Because the partitioning is defined as RANGE RIGHT, the first partition
contains all values less than 20150101. The next two partitions contain data for January and
February of 2015 respectively. The last partition includes data from 20150301 and greater. This
is illustrated in the following table.
Partition Number Values
1
OrderDate < 20150101
2
20150101 <= OrderDate < 20150201
3
20150201 <= OrderDate < 20150301
4
OrderDate >= 20150301
When creating monthly partitions, it is easier to define the boundaries as RANGE RIGHT. If the
above values were used with RANGE LEFT, the first day of the month would fall into the
previous month’s partition. This would require the boundary values to be defined using the last
day of each month. Because each month has a different number of days, it is much simpler to
use the first day of each month and RANGE RIGHT.
Data Loading with Partition Switching
Loading data using a partition switching process allows the data to be processed separately
from the existing table. The new data is manipulated while the target table is available to serve
user queries. Once the data is processed, it is quickly switched into the target table in a
metadata operation.
The partition switching process requires data to be staged in a heap table first. A CREATE
TABLE AS SELECT (CTAS) statement is then used to rebuild the target partition(s) by
combining new data in the stage table with existing data in the final target table that needs to be
kept. As an example, suppose an existing fact table is partitioned by transaction date. Each day,
a daily file of new data is provided, which can include multiple days of transactions. In the
following diagram, the existing fact table has files from June 5 and 6 that included data for both
those transaction dates. A new file for June 7 is then staged into a heap table. This file includes
24
data for transactions on June 7, as well as late arriving transactions from the previous two days.
The new fact table data needs to be merged with the existing fact data.
Existing Fact Data
Transaction Date
6/5/2015
6/5/2015
6/6/2015
File Load
Date
6/5/2015
6/6/2015
6/6/2015
New Data
Transaction Date
6/5/2015
6/6/2015
6/7/2015
File Load
Date
6/7/2015
6/7/2015
6/7/2015
Staged Fact Data – Ready for
Partition Switch
Transaction Date
6/5/2015
6/5/2015
6/5/2015
6/6/2015
6/6/2015
6/7/2015
File Load
Date
6/5/2015
6/6/2015
6/7/2015
6/6/2015
6/7/2015
6/7/2015
Figure 7: Merging New Data with Existing Data
Merging these two datasets is accomplished in a CTAS statement by unioning the staged data
with the existing fact data and specifying a WHERE clause for data that needs to be kept or
excluded. Excluding data is done in scenarios where existing data would be updated for some
reason. Any data transformations or surrogate key lookups for the staged data can be
performed as part of this CTAS statement. Pseudocode for the CTAS process described above
is shown here:
CREATE TABLE FACT_CTAS <Table Geometry>
AS
SELECT * FROM FACT
WHERE <Exclusion Criteria>
UNION ALL
SELECT <columns>
FROM FACT_STAGE
Note that if each load has its own separate partition, there is no need to union data because
new data is not being merged into an existing partition with data.
Once this new dataset has been created, a partition switching operation can be performed to
move the data into the final table. The partition switching process requires three tables: the
existing fact table, the table created in the CTAS statement, and a third table to move the old
data. The steps for performing the switch process are:
1. Truncate the switch out table.
2. Switch a partition from the existing fact table to the switch out table.
3. Switch the corresponding partition from the CTAS table to the target fact table.
25
CTAS Table
Target Fact Table
3
Switch Out Table
2
20150605
20150605
20150606
20150606
20150607
20150607
1
Figure 8: Partition Switching Sequence
This process is then repeated for each partition that was rebuilt. The code would resemble the
following:
TRUNCATE TABLE FACT_SwitchOut
SET @SwitchOut = 'ALTER TABLE FACT SWITCH PARTITION ' + @PartitionNbr + ' TO
FACT_SwitchOut'
EXEC (@SwitchOut)
SET @SwitchIn = 'ALTER TABLE FACT_CTAS SWITCH PARTITION ' + @PartitionNbr + '
TO FACT PARTITION ' + @PartitionNbr
EXEC (@SwitchIn)
The above code can be executed in a loop by SSIS or a stored procedure. For each execution,
a different partition number is assigned to the @PartitionNbr variable. Note that switching
partitions uses the partition number as a parameter, not the value of the partitioning column.
This means that the new data will need to be mapped to the appropriate partitions. For the
typical case of date based partitioning that uses a smart integer key of the form YYYYMMDD,
the following code will accomplish this mapping:
SELECT
P.partition_number,
PF.boundary_value_on_right,
CAST((LAG(PRV.value, 1, 19000101) over (order by ISNULL(PRV.value,
99991231))) as int) as Lower_Boundary_Value,
CAST(ISNULL(PRV.value, 99991231) as int) as Upper_Boundary_Value
FROM sys.tables T
INNER JOIN sys.indexes I
on T.object_id = I.object_id AND
I.index_id <2
INNER JOIN sys.partitions P
on P.object_id = T.object_id AND
P.index_id = I.index_id
INNER JOIN sys.partition_schemes PS
on PS.data_space_id = I.data_space_id
INNER JOIN sys.partition_functions PF
ON PS.function_id = PF.function_id
LEFT OUTER JOIN sys.partition_range_values PRV
on PRV.function_id = PS.function_id AND
26
PRV.boundary_id = P.partition_number
WHERE T.name = <table name>
For large amounts of data, dwloader, combined with this CTAS and partition switching
methodology, is the preferred method for loading APS. It offers fast and predictable
performance. Additionally, it eliminates the need to perform index maintenance, as the partitions
are constantly being rebuilt as part of the loading process. A rebuild of the index to consolidate
small rowgroups is not needed during a separate maintenance window.
There are some considerations for using this approach. While partition switching is a virtually
instantaneous operation, it does require a brief schema lock on the table being altered. In
environments with either long-running queries or very active query patterns, a schema lock may
not be able to be obtained, causing the partition switching process to wait. If the ETL processes
will not run during business hours, then this should not be a problem.
Statistics
SQL Server Parallel Data Warehouse (PDW) uses a cost based query optimizer and statistics to
generate query execution plans to improve query performance. Up-to-date statistics ensure the
most accurate estimates when calculating the cost of data movement and query operations. It is
important to create statistics and update the statistics after each data load.
Different Types of Statistics
SQL Server PDW stores two sets of statistics at different levels within the appliance. One set
exists at the control node and the other set exists on each of the compute nodes.
Compute Node Statistics
SQL Server PDW stores statistics on each of the compute nodes and uses them to improve
query performance for the queries which execute on the compute nodes. Statistics are objects
that contain statistical information about the distribution of values in one or more columns of a
table. The cost based query optimizer uses these statistics to estimate the cardinality, or
number of rows, in the query result. These cardinality estimates enable the query optimizer to
create a high-quality query execution plan.
The query optimizer could use cardinality estimates to decide on whether to select the index
seek operator instead of the more resource-intensive index scan operator, improving query
performance. Each compute node has the AUTO_CREATE_STATISTICS set to ON, which
causes the query optimizer to create statistics based upon a single column that is referenced
within the WHERE or ON clause of a query. AUTO_CREATE_STATISTICS is not configurable;
it cannot be disabled. Also note that multi-column statistics are not automatically created.
Control Node Statistics
SQL Server PDW stores statistics on the control node and uses them to minimize the data
movement in the distributed query execution plan. These statistics are internal to PDW and are
not made available to client applications. In addition, the DMVs report only statistics on the
compute nodes and do not report statistics on the control node. To create the control node
statistics, PDW merges the statistics from each compute node in the appliance and stores them
as a single statistics object on the control node. For distributed tables, PDW creates a statistics
27
object for each of the eight distributions on each of the compute nodes, and one statistics object
on the control node for the entire table. For replicated tables, PDW creates a statistics object on
each compute node. Because each compute node will contain the same statistics, the control
node will only copy the statistics object from one compute node.
Statistics can also be created on an external table. The data for an external table exists outside
PDW, either in Hadoop or Azure blob storage. PDW only stores metadata for the table
definition. When you create statistics on an external table, SQL Server PDW will first import the
required data into a temporary table on PDW so that it can then compute the statistics. The
results are stored on the control node. For extremely large external tables, this process can take
significant time. It will be much faster to create sampled statistics, which only requires the
sampled rows to be imported.
Recommendations for Statistics
Collection of Statistics
Collecting statistics is performed based upon a default sample, custom sample percentage, filter
selection, or a full table scan. Increasing the amount of data sampled improves the accuracy of
the cardinality estimates at the expense of the amount of time required in which to calculate the
statistics. For large tables, the default sampling can be used as a starting point, whereas
FULLSCAN can be used for smaller tables. It is also recommended to perform a FULLSCAN for
date (or smart date key) columns that are used in query predicates.
Collection of Multi-Column Statistics
Statistics on all join columns within a single join condition and for all aggregate columns within a
GROUP BY clause will provide additional information for the optimizer to generate a better
execution plan, such as density vectors. When creating multi-column statistics, the order of the
columns in the statistics object does not matter.
Minimum Requirements for Statistics Collection
At a minimum, statistics should be created for each of the following:







Distribution Column
Partition Column
Clustered Index Keys
Non-Clustered Index Keys
Join Columns
Aggregate Columns
Commonly used predicate columns
Always ensure that statistics do not go stale by updating the statistics automatically as part of
the transformation schedule or a separate maintenance schedule, depending on the volatility of
the column or columns. Updating statistics is an important task for the dba.
Some data movement may be necessary to satisfy a query. It is worth mentioning that the PDW
engine and Data Movement Service (DMS) does an excellent job of minimizing data movement
28
operations using predicate pushdown, local/global aggregation, and table statistics. It is very
important to keep PDW statistics up-to-date so the query optimizer can minimize data
movement as much as possible.
If you are using external tables, it is also required to create and maintain statistics to take
advantage of predicate pushdown. For example, when an external table is created over a
Hadoop data source, statistics will be used to determine if all of the data should be imported
from Hadoop and processed in PDW, or if map-reduce jobs should be created to process the
data in Hadoop and return the results.
Conclusion
The reference architecture for optimizing distributed database design on the Microsoft Analytics
Platform System is intended to provide guidance for designing a distributed database solution
with Microsoft APS. While customers can benefit from the shorter time to market of a data
warehouse solution implemented on APS, care should be given to make sure the solution is
designed to take advantage of the MPP architecture. A key point in designing for APS is how to
spread the data over the appliance. Choosing the correct table type, replicated or distributed,
and the correct distribution column for distributed tables is critical for unlocking the performance
of the appliance. Additionally, indexes need to be chosen to satisfy query requirements, large
tables should be partitioned to support loading and querying, and statistics must be created and
maintained so the query optimizer can ensure good query plans are chosen. With careful
consideration of the points in this document, APS can provide insight into huge volumes of data
without sacrificing data loading or querying performance.
For more information:
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
29