Scaling Up Your Data Warehouse with
SQL Server 2008 R2
SQL Server Technical Article
Writers: Eric N. Hanson, Kevin Cox, Alejandro Hernandez Saenz, Boris Baryshnikov,
Joachim Hammer, Roman Schindlauer, and Grant Dickinson of Microsoft Corporation.
Gunter Zink of HP.
Technical Reviewer: Eric N. Hanson, Microsoft Corporation
Editor: Diana Steinmetz
Published: June, 2008; Updated: December, 2010
Applies To: SQL Server 2008, SQL Server 2008 R2
Summary:
SQL Server 2008 introduced many new functional and performance improvements for
data warehousing, and SQL Server 2008 R2 includes all these and more. This paper
discusses how to use SQL Server 2008 R2 to get great performance as your data
warehouse scales up. We present lessons learned during extensive internal data
warehouse testing on a 64-core HP Integrity Superdome during the development of the
SQL Server 2008 release, and via production experience with large-scale SQL Server
customers. Our testing indicates that many customers can expect their performance to
nearly double on the same hardware they are currently using, merely by upgrading to
SQL Server 2008 R2 from SQL Server 2005 or earlier, and compressing their fact
tables. We cover techniques to improve manageability and performance at high-scale,
encompassing data loading (extract, transform, load), query processing, partitioning,
index maintenance, indexed view (aggregate) management, and backup and restore.
Audience: Data Warehouse architects, developers, and DBAs.
Scaling up your Data Warehouse with SQL Server 2008 R2
Copyright
This is a preliminary document and may be changed substantially prior to final commercial release of the
software described herein.
The information contained in this document represents the current view of Microsoft Corporation on the issues
discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it
should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the
accuracy of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or
transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or
for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement from
Microsoft, the furnishing of this document does not give you any license to these patents, trademarks,
copyrights, or other intellectual property.
Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses,
logos, people, places and events depicted herein are fictitious, and no association with any real company,
organization, product, domain name, email address, logo, person, place or event is intended or should be
inferred.
 2010 Microsoft Corporation. All rights reserved.
Microsoft, C#, SQL Server, Visual Basic, Windows, and Windows Server are either registered trademarks or
trademarks of Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective
owners.
Scaling up your Data Warehouse with SQL Server 2008 R2
Table of Contents
Introduction ......................................................................................................1
Benefits of Scale-Up Solutions for Data Warehousing .......................................2
Reliability and Availability Reduces Unplanned Downtime ..................................... 2
Best Practices for Building a Scale-Up Data Warehouse Solution ......................3
Database Design ............................................................................................. 3
Schema Design .......................................................................................... 3
Physical Database Design ............................................................................ 3
Hardware Selection ......................................................................................... 8
Processors ................................................................................................. 8
Memory .................................................................................................... 9
Software Selection .......................................................................................... 9
SQL Server Settings ........................................................................................ 9
Query Design................................................................................................ 10
Query Tuning................................................................................................ 16
ETL ..................................................................................................................21
Pipeline Performance ..................................................................................... 21
Lookup Performance ...................................................................................... 21
ETL Features in the SQL Server Engine ............................................................ 22
Change Data Capture .................................................................................... 22
Merge .......................................................................................................... 23
Minimally Logged Insert ................................................................................. 25
Periodic Maintenance ......................................................................................25
Index Maintenance ........................................................................................ 26
Statistics Maintenance ................................................................................... 26
Backup ........................................................................................................ 26
Internal Scale Testing During SQL Server 2008 Development .........................27
Workload ..................................................................................................... 27
Hardware Configuration ................................................................................. 27
Performance Results ...................................................................................... 28
Conclusion.......................................................................................................30
References ......................................................................................................32
Scaling Up Your Data Warehouse with SQL Server R2
1
Introduction
For high performance, simplicity of management, and reliability, scale-up-based data
warehouse configurations based on Microsoft® SQL Server® 2008 R2 offer a very
attractive solution for your enterprise data warehouse needs. Scale-up based data
warehouse solutions are based on single shared-memory multiprocessor (SMP)
computer systems. This paper describes how you can effectively create, manage, and
query your enterprise data warehouse with SQL Server 2008 R2 when using a large
SMP.
The goals for a data warehouse solution are typically the following, although they may
vary somewhat in different environments:

Support rapid response time for queries that summarize large volumes of data.

Support efficient drill-through queries that extract moderate volumes of detail data
for further analysis.

Support rapid initial data loading.

Support efficient incremental loading (for example, load 24 hours of data in a onehour window).

Support fast maintenance operations (backup, restore, loading, indexing,
compression, bulk removal of old data).

Given that the above goals are met, limit total cost of ownership (TCO):

Minimize the database administrator (DBA) time required during setup and for
normal operations.

Keep hardware costs reasonable.
Data warehouse designs are most commonly based on a star or snowflake schema, also
known as a dimensional model. This approach has advantages for query performance
and makes it easy for business users to understand the data. Some parts of this paper
assume basic knowledge of star-schema-based data modeling. Ralph Kimball and Margy
Ross provide an excellent reference on dimensional modelingi. Mundy et al. give
another good reference on applying dimensional modeling specifically with Microsoft
productsii.
Data warehouses are also sometimes based on normalized schemas. An example of this
is a schema containing one-to-many relationships between two large tables, such as
Orders and LineItems from an order-entry system. One reason people use this type of
schema is that they sometimes mirror the schema of the online transaction processing
(OLTP) system that is the primary source of the data. This makes it easier to load the
data warehouse simply by accumulating more history from the OLTP system, with few
or no other transformations. If historical data needs to be updated, a normalized
schema can make this easier. A common approach is to build star-schema-based
subject area data marts for reporting from a normalized enterprise data warehouse
used for integrating data and storing large volumes of fine-grained historical data.
Whether you are using a star schema or a normalized schema for your data warehouse,
the ideas covered in this white paper apply. Because the star schema approach is
common and well-defined, we assume it is the approach being used unless stated
otherwise.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
2
Benefits of Scale-Up Solutions for Data
Warehousing
For purposes of this discussion, we consider a “scale up” system to be any data
warehouse machine based on a single SMP with 4 or more processors. These include 4
and 8 processor SQL Server Fast Track Data Warehouseiii configurations, as well as
larger SMP systems such as the HP Integrity SuperDome. Scale-up based solutions
running SQL Server 2008 R2 can provide an excellent combination of rich
programmability, broad application compatibility, high performance, scalability,
management simplicity, high reliability, and reasonable total hardware purchase cost.
An alternative to scaling up is to scale out using a SQL Server Parallel Data Warehouse iv
(PDW) configuration. PDW can support tremendous scalability, high performance
loading and updates, and fast query processing, all at lower hardware cost than the
highest-end scale-up systems. However, the other SQL Server Editions currently
support a broader set of programming capabilities and language features than PDW V1,
and integrates with a broader set of third party applications. For those customers who
want to use these features, and otherwise want full compatibility with SQL Server,
scaling up is attractive.
Reliability and Availability Reduces Unplanned
Downtime
High-end scale-up systems typically offer very high reliability. For example, the
HP Integrity Superdome offers added RAS (Reliability, Availability, and Serviceability)
capabilities. And the HP Integrity product line supports the ability to logically remove a
single failed processor or memory module (DIMM) from the system automatically, until
it can be replaced by a technician. The system can operate even with the failure of up
to half of the processors. Similarly, automatic error detection, error correction, error
isolation, and recovery from memory failures are supported. The memory systems of
computers such as the HP Integrity Superdome offer sophisticated error correction,
making them tolerant of transient memory faults and hard faults in individual memory
components. High-end scale-up systems also support hot-add of memory. Added
memory becomes immediately visible to all the processors of the scale-up system—it is
not necessary to restart the server or SQL Server.
Scale-up based systems are often used together with Storage Area Networks (SANS),
such as the HP XP 240000. SANs also simplify some management operations compared
to the direct-attached storage used in scale-out approaches. For example, SANs
typically support operations such as:

Very fast copying of database files, which is useful as part of the extract,
transform, load (ETL) process.

Block-level replication of entire volumes for disaster recovery.

Copy-on-write snapshot of a database from the production environment to use
for many purposes.
These operations can make it much simpler to manage your data warehouse and they
work well in a scale-up computing system.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
3
Best Practices for Building a Scale-Up Data
Warehouse Solution
A number of best practices will help you get the best performance and the simplest and
most cost-effective system management using SQL Server 2008 R2 in a scale-up
configuration. These best practices cover the following:

Database design

Schema design

Physical design

Hardware selection and configuration

Software selection and configuration

Data loading

Query specification and tuning
Some of these best practices are independent of the type of hardware being used.
Others are more closely related to the use of a scale-up configuration. We focus here on
the best practices that are most relevant to scale up, although we touch on other best
practices briefly as well. This paper focuses mainly on software techniques associated
with SQL Server 2008 R2. For other best practices for using SQL Server on large SMPs
from HP with an emphasis on hardware capabilities and configuration, you may wish to
consult one of several detailed references on the subject from HP v.
Database Design
The section covers best practices for designing schemas and for the physical design of a
database such as disk layout, indexes, compression, and summary aggregates.
Schema Design
Database schema design (logical design) can have a major impact on the performance
of queries and updates. For best performance, and ease of understanding the data and
writing queries, we recommend using a star schema (also known as dimensional
modelingvi) if that is feasible in your data warehouse environment. SQL Server 2008
Enterprise Edition has special star join query optimizations that take advantage of this
type of schemavii. If you prefer a normalized schema (typically third normal form viii),
you can still get excellent query performance, but certain types of star join
optimizations may not be possible for your queries.
The data types you choose for columns have a performance impact. Use integer
(4-byte) surrogate key columns to join your fact table to your dimension tables in a star
schema. These compress well and support fast comparisons during joins. For decimal
measure values, use the money data type if possible because operations on it are
faster than for general numeric or decimal types. Avoid the use of long fields of any
type in your fact table unless it is essential to the behavior of your application. This
helps keep your fact table smaller and usually speeds up query processing and data
loading.
Physical Database Design
After you choose your logical database design, how you physically structure your data
warehouse data has a major performance impact. Physical design considerations
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
4
include partitioning, indexing, creation of summary aggregates, compression, layout of
tables and partitions on storage devices, and configuration of tempdb. We highly
recommend that you follow the guidelines for creating LUNs, files, filegroups, tables,
indexes, and loading data, given in the most current Fast Track architecture guideix. The
guide goes into significantly more detail than we give here.
Table Partitioning
With the use of large SMPs for scale up, it becomes feasible to manage and query very
large tables, even tables with billions of rows. With tables of this size, it is important to
be able to break them down into manageable-sized chunks for system management
operations so that these operations complete in a reasonable time. The table and index
partitioning feature in SQL Server Enterprise Edition fills this need. For a data
warehouse fact table beyond 50 GB, it is recommended that you partition it by time,
such as by week. This allows you to bulk purge a week of data at a time with a
metadata-only operation. It also can improve performance of loading, and management
of summary aggregates maintained using indexed views, as discussed later.
Note: SQL Server table partitioning is separate and distinct from the hardware
server partitioning supported by the HP Integrity Superdome.
Index Design
A basic index design that is often effective for a star schema in SQL Server is to:
1. Create an integer date key in the format of YYYYMMDD as the surrogate key for
both the date dimension and a foreign key in the fact table1. Create a clustered
index on the date key column on the fact table. With the date key in this format,
you can express date range filters conveniently and explicitly on the fact table. As
we will discuss later, this in turn can give you better parallel query speed in some
cases.
2. Create nonclustered indexes on the fact table on the surrogate keys of the most
frequently used dimensions (this is optional – see the Fast Track guide for more
details).
3. Create nonclustered indexes on large dimension tables (such as a Customer
dimension with millions of rows) on the columns you search on frequently in those
tables.
If you use partitioning to support a sliding-window scenario, partition the fact table and
all its indexes on the date key. If your table partition width is one day, there is no need
for the clustered index on the date key of the fact table in this situation. Instead, create
a clustered index on the most frequently used dimension key column of the fact table
besides the date key.
1
Alternatively, you may wish to use the new DATE built-in data type as the surrogate key type for the
date dimension. However, be aware that seeks into the fact table are not normally supported for
operations that use date functions. For best performance, use <, >, <=, >=, BETWEEN, or IN ( … list
of keys … ) to filter fact table rows by date, regardless of the type of the column used as the surrogate
key for date.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
5
Design of Summary Aggregates
Experienced data warehouse developers have long known that one of the best ways to
get good data warehouse query performance is to use summary aggregates. These
summarize data in the database into sets of rows that are typically much smaller than
the fact table. If most of your queries can be answered from these aggregates, which
are much smaller than the raw event data, query performance can be improved by an
order of magnitude. This in turn can make it possible to meet your users’ performance
requirements on a smaller hardware platform, or provide better quality service to your
users, or both.
The primary forms of summary aggregates you can use with SQL Server are:

SQL Server indexed views

User-defined summary tables
Consider using indexed viewsx or user-defined summary tablesxi to accelerate your
common aggregate queries. Indexed views can be used to maintain summary
aggregates automatically. In general, the SQL Server 2008 R2 query processor can
automatically use an indexed view to solve an aggregate query matching the structure
of the view. If the SQL Server query processor is not able to use an indexed view to
solve a query that actually could be solved using the indexed view, you can manually
rewrite the query to reference the indexed view by name, and use the NOEXPAND hint
on the reference to the indexed view in your query.
Indexed views are automatically maintained by SQL Server when you update your
database. If you need to be able to incrementally update your database, such as to
correct errors, and have the aggregates be updated automatically, this makes indexed
views a good choice. A significant improvement introduced in SQL Server 2008 is
partition aligned indexed viewsxii. This feature enables you to switch partitions of tables
in and out even if those tables have indexed views defined on them. This eliminates the
need to drop and rebuild indexed views during your ETL process, and can greatly speed
up ETL, or make it feasible to use indexed views on partitioned tables when it was not
feasible before.
Indexed views do have restrictions regarding what views can be indexed. If you want
full flexibility and are able to maintain aggregates yourself in your ETL process, you
may wish to use summary tables as described by Adamsonxiii, and modify your SQL
statements to refer to the appropriate summary tables for best performance.
Compression
Data compression, a new feature available introduced in SQL Server 2008 Enterprise
Editionxiv, is extremely valuable in a scale-up environment. It can dramatically reduce
I/O requirements by (a) reducing the amount of data that must be read from disk, and
(b) increasing the percentage of the data that resides in main memory during normal
operation. The second effect can be dramatic, speeding up some queries by an order of
magnitude or more. We have observed compression factors ranging from 2X to 7X, with
3.5X being typical. SQL Server 2008 R2 improves character data compression by
introducing Unicode compression, roughly halving the storage requirement for nchar
and nvarchar data types for many character sets.
Microsoft SQL Server Customer Advisory Team engineers have determined that for
most large data warehouses, once about 20 percent of the data is in main memory,
little or no I/O needs to be done to process queries. In practice, this 20 percent
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
6
represents the most active portion of the database. As a best practice, a rule of thumb
is to:

Compress your fact tables by using the PAGE compression featurexv.

Add enough memory to your system so that at least 10-20% of the compressed
data will fit.
Compression is not free. Its main cost is the increased CPU time required to load data.
With page compression (the form of compression with the highest compression factor),
CPU usage during loading can increase by about a factor of 2.5. Another form of
compression, called row compression, results in smaller compression factors, but the
cost to compress is significantly lessxvi. The actual benefits and costs of compression
depend on your data, hardware, and workload. However, in large-scale internal tests on
a 600-GB and 6-terabyte data warehouse and a workload of 123 different queries,
when using page compression, we observed a 30-40% improvement in query response
time with a 10-15% CPU time penalty.
SQL Server 2008 also introduced a backup compression featurexvii. Backing up a
compressed database is faster than backing up a non-compressed one because reading
the source data is faster. When backup compression is enabled, backing up a
compressed database normally results in a backup that is 10-15% smaller than when
backing up a non-compressed database, depending on the data.
Compression can be applied to individual partitions. If you have a large table that is
already partitioned, but the partitions are not compressed, and you do not have a large
time window available in which you can compress the full table, consider compressing
only the new data during ETL. Then over time, as you age out old partitions and add
new ones, the entire data set will gradually become compressed, without the need to
extensively re-organize the data all at once.
Disk Layout
According to the Gartner Group, storage throughput is the most critical resource for a
good data warehouse system. It is also one of the most expensive. A rule of thumb for
any database system is that multiple small drives are better than one big one. The
number of spindles you can apply to your storage system is probably the single most
important factor in creating a high throughput IO subsystem.
Read-and-write caching in the disk devices or the SAN can be a big benefit up to a
certain point. However, very large table scans can quickly fill this hardware-level cache
and cause it to not give much benefit.
While a proper configuration of direct attached storage systems can equal or exceed the
throughput ratings of the SANs, SAN systems have features that are highly beneficial
for very large databases. The SAN snapshots, copy on write, and other similar features
can make administrative tasks much easier.
To help you determine how good your disk layout is, download SQLIO.exe or
IOMETER.exe. Run one of these tools to get a rating for both the number of IO/sec and
throughput in bytes/sec. SQL Server uses a variety of block sizes for both reads and
writes so it is important to run a variety of tests that will match your usage patterns.
There are three basic parameters to alter when running the I/O tests: block size,
read/write, and serial/random. For SQL Server, the minimum tests to run are 8-KB and
64-KB block sizes, both serial and random, and both read and write. Most of the ratings
you find on the Web are stated in throughput of bytes/sec for the 64-KB serial read test
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
7
(which gives the highest throughput rating). A value of over 250 MB/sec per core for
today’s computers is typically sufficient.
A number of configuration settings need to be set to work with SQL Server depending
on the storage subsystem in use. The best advice is to ask your hardware vendor for
the recommended settings for use in SQL Server. There is one configuration setting
that is common to most disk subsystems or host bus adapters (HBAs) that can be
beneficial to change. The Queue Depth setting is normally 4, but for SQL Server data
warehouse systems it should be set to a minimum of 32. If you have many
Fibrechannel connections, use the maximum of 254. For instructions on how to adjust
the Queue Depth setting, consult your storage hardware documentation about how to
manage the disk controller configurations.
It is also critical to do sector alignment so that one I/O does not cross physical sectors.
Use the diskpar.exe program to change the disk alignment.
After you establish that your I/O system can provide the necessary raw throughput,
make sure that the filegroups underneath each physical index or heap have adequate
I/O bandwidth under them. A data warehouse fact table (or fact table partition if the
table is partitioned) ideally should have at least 250 MB/sec I/O capacity per processor
core if a significant part of your workload is I/O bound (limited by I/O performance).
There are many explanations published about the proper disk layout for SQL Server.
See the SQL Server Customer Advisory Team materialxviii. In addition, whether you are
using SANs or direct attached storage, refer to the disk layout guidelines in the most
recent Fast Track Data Warehouse guideiii.
tempdb Configuration
A common problem is having tempdb in a small file on the C: drive. You need to add
tempdb files on a SAN or set of direct-attached disks with sufficient I/O throughput,
and remove tempdb from the C: drive, because extent allocations are written on the
first database file. These allocations may not be fast enough if you leave that file on a
relatively slow C: drive. Similarly, log files should also not be on C:.
tempdb should be set to autogrow in case of emergencies, but should never be allowed
to grow beyond its original size. It is considered a best practice to create tempdb at an
appropriate size.
tempdb should also contain one file per CPU core. For example, if you have four dual
core CPUs, you should create eight files in tempdb. These files should be of equal size
to achieve round-robin loading. This is considered a general best practice but may not
fit all projects. And if you do not have heavy tempdb usage, it really doesn’t matter
how many files you create.
Whether to place your tempdb files on the same disk fabric as the main database
depends on your usage patterns. For some projects it is okay to put the database and
tempdb on the same spindles. If your tempdb activity is mostly on in-memory pages,
there will be very little physical I/O and it would be okay for tempdb to reside on the
same disks. For other projects this would be a mistake. Knowing your usage patterns is
essential to making this decision. Refer to the Fast Track Data Warehouse guideiii for
additional information about how to configure tempdb.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
8
Partitioning Server Resources
HP Integrity Servers allow for more efficient operations at lower cost. This may be
accomplished through virtualization. Virtualization may be implemented several
different ways:

nPars (hard partitions - NUMA)

Soft partitions (soft NUMA)

Pay as you go

Manage a single server with Windows System Resource Manager (WSRM) and/or
SQL Server 2008 R2 Resource Governor

Virtuoso for virtualized containers
The above partitioning strategies of the server allow for various databases and/or
applications to share a scale-up server.
Hardware Selection
Best practices for hardware selection include considerations for assessing processor and
memory resources. Refer to the Fast Track Data Warehouse guideiii for additional
information about how to select hardware.
Processors
The number of processors needed is usually the most difficult hardware characteristic to
choose. Several things generally determine how many processors are needed:
1. The number of queries executed in a specific period of time
2. Query complexity (simple, medium, complex or very complex)
3. Desired response time for a single query. This can determine the amount of
parallelism you want to achieve per query.
With the multi-core processors today, the estimation gets slightly more complicated.
Each additional core does not add the same processing power as the original. A
practical estimate is to assume each additional core is 50% of the original, although this
can vary depending on your usage.
Taking your workload to a lab for testing may be the best way to determine the number
of processors needed. Hewlett Packard and Microsoft both have a number of facilities
with high-end computers that can be scheduled. When creating your suite of tests to
run, do not just test the normal part of the application—you may end up with a smaller
computer than you need. Add the maintenance tasks to your list of tests. These include
index maintenance, backups, and DBCC programs. These maintenance tasks parallelize
very well and may require more CPUs than the other parts of your application.
Both Windows® 2003 and Windows 2008 have a maximum limit of 64 cores. Windows
Server 2008 extends this limit to 256 cores. (Note that hyperthreading simulates
processor cores and this limit applies to these simulated cores if hyperthreading is
enabled.)
To estimate the amount of hardware required for your application, HP recommends
using capacity planning models or, if you have a new application, use either HP’s
published Reference Configurations or the sizing tools on HP’s ActiveAnswers Web site.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
9
Memory
Having data already in memory can make query run time more predictable. Having a
fast disk subsystem is essential for quickly getting data into memory. It is also best if
the data in memory can be used by other queries.
Empirical evidence has shown that 4 GB per core is sufficient for most data warehouse
applications. This estimate can be influenced by the number of simultaneous queries or
transactions as well as the number of large table/index scans needed.
Software Selection
For the best results in scaling up your computer to its largest configuration it is best to
use the following:

Windows Server® 2008 R2 Enterprise or Datacenter Editions

SQL Server 2008 R2 Enterprise Edition (SQL Server 2005 and 2008 Enterprise
editions are both available in versions for the Intel Itanium processor). If you
are using more than 8 physical processors, then SQL Server 2008 R2 Data
Center edition is required.
This will give you the following maximum limitations:

256 cores

2 terabytes RAM
Many of the data warehouse scale enhancements in SQL Server 2008 R2 are only
available in the Enterprise edition and higher.
SQL Server Settings
There are several issues to consider when configuring the SQL Server 2008 R2
Database Engine for data warehouse workloads.
Configuring “max server memory” xix
The SQL Server 2008 R2 Database Engine manages memory dynamically and by default
is configured to use as much memory as it can get from the operating system. While
this is desired on a dedicated system where nothing else is running, it may starve other
applications from getting memory if they are sharing the computer. This is especially
important if multiple instances of the SQL Server 2008 R2 Database Engine or other
components, such as Analysis Services, Reporting Services, and Integration Services
are running on the same system. We recommend that you configure max server
memory to a value that accounts for other memory-consuming applications and the
operating system itself2.
Configuring “max degree of parallelism”xx
We recommend setting max degree of parallelism for the server to its default value
0 to be able to fully use the parallelism enhancements in SQL Server 2008 R2 unless
Enterprise Edition of 64-bit SQL Server 2008 R2 uses so-called Large Memory Pages (see Large
Page Support, MSDN Library), which are not reflected in the memory usage of the SQL Server
process. In this case, it may appear that SQL Server is not using memory while the system may
be starved for memory. This is similar in concept to using the AWE mechanism on a 32-bit
system. Information about memory usage by SQL Server should be obtained from either
SQL Server-provided performance counters or DBCC MEMORYSTATUS command.
2
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
10
you experience parallelism performance problems. If you experience parallel query
performance problems in a multi-user parallel query environment, reduce the max
degree of parallelism setting to ½ or ¼ of the number of logical processors on the
machine (but not more than 64). See the paper by Don Pinto et al. on SQL Server
parallel query executionxxi for more details. The default maximum degree of parallelism
gives an effective degree of parallelism of 64 in SQL Server 2008 R2 if the machine has
more than 64 logical processors. Note that you can control the degree of parallelism of
a particular workload with Resource Governor as outlined in the following sections. You
can also cap the degree of parallelism for a specific query by using the MAXDOP query
hint.
Configuring affinity masksxxii
Affinity masks (CPU and I/O) are generally used when consolidating multiple instances
on the same system or reserving CPUs for other applications but eventually their use
may result in CPU underutilization. There are other means of controlling CPU usage in
such environments, including Resource Governor and Windows System Resource
Manager (WSRM).
Using Resource Governorxxiii
Resource Governor is a feature introduced in SQL Server 2008 that helps improve the
predictability of workload execution on the server for multiple combined workloads (for
example, ad-hoc queries with pre-canned reports). Separating workloads into distinct
workload groups helps monitor their resource usage. Furthermore, Resource Governor
provides maximum/minimum CPU bandwidth and maximum/minimum aggregate
memory usage controls. Each individual workload can be configured for a particular
maximum degree of parallelism depending on the style and type of queries being run in
the workloadxxiv. Setting maximum resource usages help contain large workloads while
setting minimums guarantees that mission-critical workloads have predictable execution
times on a loaded server. For details on parameters controlled by Resource Governor,
see SQL Server Books Onlinexxv and the Resource Governor best practices
whitepaperxxvi.
Using Windows System Resource Manager (WSRM) xxvii
SQL Server Resource Governor provides controls to manage CPU and memory
consumption within the SQL Server process (sqlservr.exe). WSRM is a recommended
way to manage CPU bandwidth allocation between different processes on the server,
such as different database instances or other components such as Analysis Services.
WSRM is best used when more than one service that consumes a significant amount of
resources is installed on the same computer. These services could include any
combination of SQL Server, Analysis Services, Reporting Services, and Integration
Services, your own services, and third-party services. In this scenario WSRM would be
used to manage the processor affinity for Analysis Services, Reporting Services, and
Integration Services. It is best to use the built in controls for processor affinity for
SQL Server. WSRM would also be used to control the memory consumption for
Reporting Services and Integration Services since Analysis Services and SQL Server
have their own controls.
Query Design
Designing your queries well can have a big positive impact on how they perform. Query
design is, of course, closely related to schema design. Again, we recommend using a
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
11
star schema if that is workable for your application. Important things to remember
when you design your queries are:
Keep your queries simple.
Specifically, try to use a single star join with grouping and aggregation when possible.
SQL Server Enterprise Edition has specific optimizations for these types of queries. In
addition, avoid using expressions, functions, or local variables in the WHERE clause, as
these will cause SQL Server to “guess” the selectivity of a condition. This way the query
optimizer will be much more likely to choose a good query plan. See the white paper on
statistics in SQL Server for additional details on what constructs to avoid in queries to
help the optimizer get a good query planxxviii.
In addition, avoid including parameters or variables (such as @ParameterName or
@VariableName) in your data warehouse queries unless you know you need to save on
compile time. Use literals (that is, actual numbers, dates, or strings) in your query
instead of parameters. This will help you get a better query plan. For data warehouse
queries, the cost of compiling the plan is usually dwarfed by the cost of execution, so
plan reuse is not important. If you want to parameterize your queries to make it easier
to program your application or to improve security as a way to prevent SQL injectionxxix,
use the OPTION(RECOMPILE) hint on your query to make sure you get the best plan for
the specific parameter values you are passing in to your query.
Consider using temp tables instead of CTEs.
Although common table expressions (CTEs) are convenient for programming, they do
not force SQL Server to generate a temporary partial query result. Heavy use of CTEs in
your queries may lead to redundant computation of the CTE expression results by the
query plan in some cases. For better performance, consider using temp tables instead
of CTEs, in order to break down a large business question into manageable chunks. This
may of course mean you have to split a single query with several CTEs into multiple
queries that retrieve results into temp tables. Temp tables have statistics while table
variables do not, which is why we do not recommend table variables here.
Put an explicit date filter on the fact table.
Avoid implying a range filter on the fact table by using a join with a “Date,” “Time” or
“Period” dimension unless it is truly necessary. This will help avoid some situations with
partitioned fact tables where partitions are touched even though they cannot contain
relevant data for the query. Even more importantly for scale up, this also allows pure
hash-join plans to be chosen more frequently. These pure hash-join plans parallelize
well and enable you to benefit from the power of all the processor cores on your
shared-memory multiprocessor server.
In order to be able to explicitly filter on date on the fact table easily, encode the
surrogate keys for date as integers in the form YYYYMMDD. For example, April 25, 2011
would be represented as 20110425.
For example, suppose we have a database with the same schema as the
AdventureWorksDW sample database available with SQL Serverxxx, except that it has a
very large partitioned fact table for Internet sales. Furthermore, we modified the
DimTime dimension and this fact table to use surrogate keys for the date, which is
encoded as YYYYMMDD. To find the top 10 products in total sales between January 1
and 7, 2011, the following would be a good query to use:
select top 10 p.ProductKey, sum(f.SalesAmount)
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
12
from FactInternetSales f, DimProduct p
where f.ProductKey=p.ProductKey and p.ProductAlternateKey like N'BK-%'
and f.OrderDateKey between 20110101 and 20110107
group by p.ProductKey
order by sum(f.SalesAmount) desc
Avoid a query like the following when possible because it filters on date via join:
select top 10 p.ProductKey, sum(f.SalesAmount)
from FactInternetSales f, DimProduct p, DimTime t
where f.ProductKey=p.ProductKey and p.ProductAlternateKey like N'BK-%'
and f.OrderDateKey = t.TimeKey
and t.MonthNumberOfYear = 1
and t.CalendarYear = 2011
and t.DayNumberOfMonth between 1 and 7
group by p.ProductKey
order by sum(f.SalesAmount) desc
If you must include values from a period dimension in the output of your query to
include it in a report or to control the grouping level for aggregation, you can join it to
the fact table but still not use it to filter. You would still filter explicitly on the fact table.
For example, the following query joins on DimTime but does not filter using it, and pulls
in the year and month for display:
select top 10 p.ProductKey, t.CalendarYear, t.EnglishMonthName,
sum(f.SalesAmount)
from FactInternetSales f, DimProduct p, DimTime t
where f.ProductKey=p.ProductKey and p.ProductAlternateKey like N'BK-%'
and OrderDateKey between 20110101 and 20110107
and f.OrderDateKey=t.TimeKey
group by p.ProductKey, t.CalendarYear, t.EnglishMonthName
order by sum(f.SalesAmount) desc
Avoid complex expressions inside aggregates in your query.
For example, instead of SUM(1.1*x) use 1.1*SUM(X). This can reduce the amount of
time spent evaluating expressions at the innermost level of query execution.
Avoid mixing DISTINCT and non-DISTINCT aggregates in the same query.
SQL Server sometimes generates multi-consumer spool operators as part of a query
plan. The portion of the query plan above a multi-consumer spool typically is not
parallelized. For this reason, if you have a performance problem with a query that has a
multi-consumer spool, consider rewriting the query to eliminate the multi-consumer
spool. Multi-consumer spools typically arise when you mix DISTINCT and non-DISTINCT
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
13
aggregates in the same query. In the following query, we use AGG1 and AGG2 to
represent any aggregates, such as SUM, COUNT, MIN, and MAX. For example, a query
or subquery of this form may generate a multi-consumer spool:
SELECT a, AGG1(DISTINCT b), AGG2(c)
FROM table
GROUP BY a
You can rewrite this to eliminate the use of DISTINCT and thus remove the multiconsumer spool from the plan, as follows:
SELECT a, AGG1(b), AGG2(c)
FROM
(
SELECT a, b, AGG2(c)
FROM table
GROUP BY a, b
) as T
GROUP BY a
This rewrite is only correct if the aggregates AGG1 and AGG2 can be computed
incrementally from their input, regardless of how the input has been ordered, or
partially grouped and aggregated. SUM, COUNT, MIN, and MAX all have these required
characteristics. However, user-defined CLR aggregates might not have the necessary
properties.
Use GROUPING SETS.
The new GROUPING SETS feature introduced in SQL Server 2008 can enable you to
avoid redundant computation and shorten your queries. For example, you want to
create a report that shows sales totals for a year, quarter, and country, in addition to
country totals, period totals, and a grand total. A typical way to display this information
would be a report with the format shown in the following figure.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
14
Figure 1: Pivot table report layout suitable for displaying data retrieved by
GROUPING SETS query
The GROUP BY CUBE notation in SQL is not sufficient to express this query because the
period of interest is represented by two columns, not one. The new GROUPING SETS
feature enables you to specify all sets of grouping columns you are interested in, in a
single relatively simple query as follows:
USE AdventureWorksDW;
SELECT
D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry,
SUM(F.SalesAmount) AS SalesAmount
FROM
dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN ('2010', '2011')
GROUP BY GROUPING SETS (
(D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry),
(D.CalendarYear, D.CalendarQuarter),
(T.SalesTerritoryCountry),
()
);
Without using GROUPING SETS, the simplest way to write this query is as the union of
several SELECTs, one for each grouping set, similar to the following:
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
SELECT
15
D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry,
SUM(F.SalesAmount) AS SalesAmount
FROM
dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN ('2010', '2011')
GROUP BY D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry
UNION ALL
SELECT
D.CalendarYear, D.CalendarQuarter, NULL,
SUM(F.SalesAmount) AS SalesAmount
FROM
dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN ('2010', '2011')
GROUP BY D.CalendarYear, D.CalendarQuarter
UNION ALL
SELECT
NULL, NULL, T.SalesTerritoryCountry,
SUM(F.SalesAmount) AS SalesAmount
FROM
dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN ('2010', '2011')
GROUP BY T.SalesTerritoryCountry
UNION ALL
SELECT
NULL, NULL, NULL,
SUM(F.SalesAmount) AS SalesAmount
FROM
dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
16
WHERE D.CalendarYear IN ('2010', '2011')
;
This second form of the query is of course much longer, and the query plan generated
for it is less efficient, with a run time about 2.5 times longer than the GROUPING SETS
form.
Query Tuning
If you are having a performance problem with a query in your data warehouse on a
large SMP system, you can often improve performance by tuning the query by
modifying it so that it runs faster. Before you resort to tuning a query, make sure you
have appropriate indexes and your statistics are up to date. Also, make sure you have
chosen your hardware (processors, memory, and storage) and configured it according
to the guidelines outlined earlier in this paper. If you have good index structures and
fresh statistics, and your hardware is up to the task of running the query in the desired
response time, first look at the query and make sure you are using the guidelines
outlined in the preceding section, Query Design.
If you still have a performance problem, you may need to tune the query. There are
two main classes of query plans for star join queries typical in data warehouses: scan
plans and seek plans. Scan plans scan a range of the fact table and join the resulting
rows with one or more dimensions. Seek plans find rows from dimensions that qualify,
and then seek into the fact table to find joining rows, via clustered or nonclustered
indexes on the dimension key columns of the fact table.
Seek plans are typically the most efficient for highly selective queries where a tiny
percentage of rows qualify. Scan plans typically are the most efficient when a relatively
large fraction of the rows qualify.
If you are getting poor performance for a query even though everything else checks
out, the main categories of problems to consider are these:

Plan choice: The query plan is not efficient enough and you need a better plan
(independent of parallelism)

Parallel performance: The query plan is a parallel plan, but the full parallel
power of the computer is not being utilized effectively.
You can tell that parallel performance is a problem if you observe the following:
1. You are getting a serial plan but runtime is too long. You can tell the plan is
serial if the plan does not have any parallelism (exchange) operatorsxxxi.
2. You run the query with degree of parallelism K and during a significant part of
execution, fewer than K cores are shown to be busy in Windows Task Manager
under the Performance tab.
3. In the STATISTICS XML plan or runtime .sqlplan file generated from graphical
showplan, the RunTimeCountersPerThread information shows that the
ActualRows count for each thread is skewed so that a small number of threads
are doing most of the work.
For an example that illustrates both symptoms 2 and 3 above, if you run a query that
does a nested loop join between a Date dimension and a partitioned fact table, and only
three different Date rows qualify, you may see RunTimeCountersPerThread
information that look similar to this:
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
17
<RunTimeInformation>
<RunTimeCountersPerThread Thread="7" ActualRows="0"
ActualEndOfScans="1" ActualExecutions="1" />
<RunTimeCountersPerThread Thread="5" ActualRows="0"
ActualEndOfScans="1" ActualExecutions="1" />
<RunTimeCountersPerThread Thread="6" ActualRows="0"
ActualEndOfScans="1" ActualExecutions="1" />
<RunTimeCountersPerThread Thread="4" ActualRows="0"
ActualEndOfScans="1" ActualExecutions="1" />
<RunTimeCountersPerThread Thread="3" ActualRows="0"
ActualEndOfScans="1" ActualExecutions="1" />
<RunTimeCountersPerThread Thread="2" ActualRows="49153977"
ActualEndOfScans="1" ActualExecutions="1" />
<RunTimeCountersPerThread Thread="1" ActualRows="51723098"
ActualEndOfScans="1" ActualExecutions="1" />
<RunTimeCountersPerThread Thread="8" ActualRows="56013568"
ActualEndOfScans="1" ActualExecutions="1" />
<RunTimeCountersPerThread Thread="0" ActualRows="0"
ActualEndOfScans="0" ActualExecutions="0" />
</RunTimeInformation>
It is normal for thread 0 to have zero actual rows. The other threads ideally will have
roughly the same number of rows. In this case, five threads have zero ActualRows
counts. This pattern usually arises from a nested loop join that has a small number of
outer rows qualifying (equal to the number of threads doing work). You can work
around the problem by modifying the query to force the use of a hash join.
For example consider a query like the following (based on the Project REAL schema xxxii):
select COUNT(*)
from Tbl_Fact_Store_Sales f, Tbl_Dim_Date d
where f.SK_Date_ID = d.SK_Date_ID
and d.Date_Value in
(N'December 1, 2011', N'December 2, 2011', N'December 3, 2011')
With a large fact table, this query typically gets a parallel nested loop join query plan.
That plan will use at most three threads because there are only three qualifying outer
rows. To improve parallelism, you can rewrite the query to the following equivalent
form:
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
18
select Date_Value, COUNT(*)
from Tbl_Fact_Store_Sales f, Tbl_Dim_Date d
where f.SK_Date_ID = d.SK_Date_ID
and f.SK_Date_ID between 20111201 and 20111203
group by Date_Value
option (hash join)
This query gets the same answer as the previous one but it eliminates the nested loop
join and can be fully parallelized—it can use all the processor cores on the computer.
The important changes from the original query are that the range predicate is now
expressed directly on the fact table and not on the dimension table, and that the query
hint option(hash join) has been added. The BETWEEN predicate on the fact table
allows a parallel range seek to be done against the fact table to find the rows for the
specified days. The use of a hash join avoids the limit of one thread per outer row faced
with a loop join.
If the number of days of data that qualify is greater than or equal to the maximum
degree of parallelism at which the query executes, filtering the fact table by using a join
with the date dimension typically will perform well, and this type of query tuning is not
needed.
Forcing Seek Plans When Scan Is Chosen and Vice Versa
As mentioned earlier, in a data warehouse star join query two main types of plans are
often possible: scan plans and seeks plans. Depending on the selectivity of the filters on
dimensions, there is a crossover point where a scan plan’s execution time equals that of
an equivalent seek plan, as illustrated below:
Figure 2. Scan-seek cost crossover as a function of join selectivity (fraction of
rows from fact table that qualify after joining with filtered dimension)
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
19
To the left of where the curves cross, a seek plan is best. To the right, a scan plan is
best. The query optimizer’s internal cost estimates for plans are quite good, but they
are not perfect. If the cost characteristics of a query place it near the crossover point
for actual scan versus seek execution time, the plan chosen by the optimizer could be a
scan when a seek would execute faster or vice versa.
The following example based on the AdventureWorksDW database illustrates this:
use AdventureWorksDW;
select d.ModelName, SUM(f.SalesAmount)
from FactInternetSales f
, DimProduct d
where d.ProductKey = f.ProductKey
and d.ModelName like N'B%'
group by d.ModelName;
The plan for this query obtained by default by the optimizer looked like this in one of
our tests (it is a scan plan because it scans the fact table rather than using the
nonclustered index on ProductKey on the fact table):
Figure 3. Scan plan for sample query (scans fact table and performs hash join
with dimension(s))
The following query has been modified to include hints to force a plan that finds the
dimension table rows that qualify, then seeks into the nonclustered index on the key of
that dimension on the fact table, and finally looks up the qualifying rows of the fact
table:
select d.ModelName, SUM(f.SalesAmount)
from FactInternetSales f with(forceseek)
, DimProduct d
where d.ProductKey = f.ProductKey
and d.ModelName like N'B%'
group by d.ModelName;
Following is a fragment of this plan (the aggregation part is excluded):
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
20
Figure 4. Seek plan for sample query
This seek plan actually executes several times faster than the equivalent scan plan
above. If you think you may need a seek plan but you see a scan plan, you can force a
seek plan by using a forceseek hint similar to that shown above. Optionally, you may
also wish to use an index hint to force seeking into a specific index.
Similarly, if you are getting a seek-style plan (joining a dimension to a nonclustered
fact table index, then looking up the associated fact table rows), you can convert it into
a plan that scans all or a range of the fact table with hints. For example, consider this
query:
select d.ModelName, SUM(Fact.SalesAmount)
from Fact /*with(index=1)*/, #tempProd d
where Fact.ProductKey = d.ProductKey
and OrderDateKey between 10 and 200
group by d.ModelName;
In this example, Fact is a large fact table with a clustered index on OrderDateKey and a
nonclustered index on ProductKey. #tempProd is a small table with two Product
dimension rows in it. The default plan for this, without hints, is a seek plan that seeks
the ProductKey index on Fact. Adding the index=1 hint (commented in the example)
changes the plan so a range seek is performed on the Fact table’s clustered
OrderDateKey index and the results are hash-joined with #tempProd. In general,
consider using index=1 hint on the fact table, optionally with an option(hash join)
query hint, to force a scan plan if that is beneficial.
One important consideration regarding using scan versus seek plans on a large SMP is
related to the costing of seek operations. Because it assumes that a seek requires a
random I/O (the worst case), the query optimizer tends to over-cost seeks, both
a. When the I/O system is very fast, and
b. When all relevant data fits in the main memory buffer pool.
In other words, scan plans may be chosen instead of seeks in these situations, even
when seeks would be better. If you are seeing scan plans and you are not getting the
performance you need, and a) or b) apply to you, consider forcing a seek plan with
hints.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
21
ETL
Scale-up systems such as the HP Integrity Superdome can provide very fast index
construction capability due to their high I/O performance and parallel processing
capability. An important part of ETL is index maintenance. Index creation and
maintenance both can be time consuming. Depending on the amount of data you will
change or add during your ETL, you may wish to modify the existing indexes in place
(by using DML statements) or drop the indexes, do your load, and then rebuild the
indexes. If you are changing a very large number of rows, the drop-and-rebuild
approach is typically better. Otherwise, maintaining the indexes in place is preferred.
Consider experimenting with both approaches to find the best one for your situation.
When it comes to loading and maintaining a data warehouse by using the ETL
approach, SQL Server 2008 R2 and SQL 2008 R2 Server Integration Services provide
some definite advantages over the SQL Server 2005 release. In this section we briefly
highlight some of the new features and how to take advantage of them to speed up
your ETL processes.
Pipeline Performance
SQL Server Integration Services (SSIS) enables you to construct ETL solutions by using
an interactive development environment with the required business functionality
represented visually as a directed graph composed of linked data sources,
transformations, and destinations. At run time, the SSIS engine converts this
abstraction into a series of pipelines that move the data down the graph, while keeping
the data in memory buffers (to avoid staging) and minimizing the number of data-copy
operations. Because of these optimizations, the SSIS engine can efficiently process data
even within an arbitrarily complex graph.
In SQL Server 2005 the pipeline execution engine allocates worker threads in an
affinitive manner, meaning that once a thread has finished its work it is not able to take
on any other pending work. This means that ETL solutions often reach a plateau of
scalability, beyond which further performance gains are difficult to achieve.
In SQL Server 2008 R2, the resource assignment uses a thread-pooling mechanism
whereby a thread can be dynamically assigned to a different task as soon as it has
completed its current work. This feature is known as Pipeline Performance since the
net result is that the SSIS pipeline engine makes more optimal use of the available
hardware, leading to predictable performance, increased parallelism, faster load times,
and a higher return on investment. The performance improvement due to this
enhancement compared to SQL Server 2005 can be impressive—speedups of a factor of
four for the same package running on the same hardware running on SQL Server 2008
R2 versus SQL Server 2005 are not uncommon. A second benefit is that you need to
spend less time manually tuning the solution, which means higher productivity.
Lookup Performance
The Lookup component in SSIS tests whether each row in a stream of rows has a
matching row in another dataset. A lookup is akin to a database join operation
(specifically a relational hash join); the key difference is that the comparison takes
place outside the context of the relational engine. Data from one source is hashed into
an in-memory cache within a Lookup component and data from the other source is
streamed past the Lookup so that specified keys can be compared. This is useful
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
22
because the two datasets may come from completely different (and possibly nonrelational) data sources.
Lookup in SQL Server 2008 R2 Integration Services has several major advantages over
the previous version:

The in-memory cache can be loaded from any source, unlike Integration Services in
SQL Server 2005, which only permitted OleDb sources. For instance, the cache can
now be loaded from a flat file, ADO.NET, OleDb, ODBC, a Visual Basic®.Net or C#®
script, or even a separate pipeline. This lessens the amount of staging in a solution,
and increases the ultimate flexibility of the solution.

The cache content can be located in virtual memory or persisted to permanent file
storage. This means that within the same package, multiple lookup components can
share the same cache. If the cache is saved to disk, it can be shared across
different packages. The cache file format is optimized for speed and access to it can
be orders of magnitude faster than reloading the reference dataset from the original
relational source.

Lookup introduces the miss-cache feature, which saves time by loading into cache
the key values that have no matching entries in the reference dataset. This reduces
a redundant and expensive trip to the database. For instance if a value is requested
from the Lookup transformation but it is not found, the next time that value is
requested the Lookup does not try to locate the value; instead it immediately
returns a ‘not found’ result. The miss-cache feature alone can contribute up to a
40 percent performance improvement in some scenarios. In addition to the partial
caching previously described, Lookup also supports full (all data values are loaded
into the cache upfront) and no caching of key values. Full caching requires that the
entire key value set fit into memory. The no caching option is only recommended if
it is evident that there will hardly ever be any repeated lookups.
Other enhancements to the lookup component include optimized I/O routines leading to
faster cache loading and lookup operations. Lookup is also easier to use and program
thanks to its more intuitive user interface, improved error handling and reporting, and
the ability to make changes to query statements at run time. Compared to Lookup in
the previous version of SSIS, there are now more options that provide better control
and hence the potential for improved performance. Users are encouraged to take
advantage of the ability to reuse the lookup cache across packages and choose a
specific caching option (full, partial, no caching) for best results.
ETL Features in the SQL Server Engine
Though the following features are not provided directly by SSIS—but rather by the SQL
Server 2008 R2 Database Engine—they are directly usable by SSIS and thus help
deliver high performance ETL solutions with the programming convenience and
flexibility of SSIS. They are also useful with custom ETL solutions.
Change Data Capture
One of the biggest performance bottlenecks in an ETL solution can be the identification
and application of the set of rows that changed during some time period, known as the
change set or delta. In many operational systems, there is no direct way to identify
which rows were changed within the last 24 hours (or however long the batch window
is) and so the question of which rows comprise the last day’s operations is difficult to
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
23
answer. Even if the rows are tagged with a change timestamp, they are often not
tagged with the operation that changed them—whether a specific change was an
update, insert or delete. The ETL developer is forced to come up with imaginative ways
of dealing with this issue. Often the source system is augmented with timestamps,
triggers, and other database objects to track changes, but these changes can be
intrusive and affect the operational behavior and performance of the source.
SQL Server 2008 introduced a compelling feature to mitigate this problem. Change
data capture uses an asynchronous log reader mechanism to propagate changes at
the table grain to capture instance tables in a transactionally consistent manner,
through windowed API functions:

Asynchronous Log Reader: Whenever a change is made to the database, the
engine first writes an entry to the database log before writing it to the database
store. This is an existing SQL Server feature that change data capture uses to good
effect. A change data capture log reader job can be scheduled to run at regular
intervals or during server idle time. The log reader trawls the logs and copies the
changes into a configured set of tables along with metadata to help identify when
and how the original changes were made. Because the reader operates
asynchronously, the cost on the source system can be amortized over a longer
period (or moved to non-operational hours such as 3 A.M.). Also, the mechanism
requires no changes to the schema of the source system (though it needs to be
hosted on SQL Server 2008 or SQL Server 2008 R2).

Table grain: Change data capture functionality can be switched on or off at the
table level. This means the solution can be implemented incrementally—change data
capture is not an all-or-nothing proposition. When a table is enabled for change data
capture, the database automatically generates a capture instance table (also called
a shadow table) that contains the data changes along with extra metadata to audit
the operations. If the shape of the source table changes (for instance a column is
added or deleted) the technology can continue to deliver a stable shape in order to
not break any downstream processes while they are updated.

Transactionally-consistent: Since the change data capture feature relies on the
database log, it is fully aware of transactions and thus can provide related changes
across multiple tables correctly.

Windowed API functions: A host of functions are available to query and manage
the capture instance tables. For instance, the APIs can be used to select all the data
changes from the last batch window (such as 24 hours), with each row tagged with
its operation (insert, update, or delete) as well as a mask identifying which specific
columns changed.
By using the change data capture feature, ETL solutions can be designed to be much
more robust and deliver increased performance and scalability. Lowering the cost of
change capture by using change data capture can lower ETL processing time
significantly, shortening your batch window, increasing availability of your data
warehouse, and reducing ETL hardware costs. Change data capture can also enable
“trickle feed” scenarios without the need for changes to your operational applications.
Merge
The MERGE statement (also known as upsert) enables the database to apply a source
rowset containing a mixture of inserted, updated, and deleted rows against a
destination table within a single operation. The merge clause enables you to specify
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
24
how each operation type should occur—including advanced semantics such as SCD-2
(slowly changing dimension Type 2) behavior.
The common problem that this operator solves is that when the source system provides
a batch of rows, some of which are new (inserts) and some of which already exist in the
destination but have different values (updates), there is no way to distinguish the
respective types. One solution would be to use an SSIS Lookup operation to figure out
which rows were new and which already existed. The easiest option to accomplish this
is by using the Slowly Changing Dimensions wizard available in SSIS 2008 R2, which
automatically configures the proper packages for performing the merge operation.
However, this approach may not scale well since the Lookup component requires that
one side of the join is explicitly cached in memory. Although that sounds fine for simple
cases, if the data to be cached is large and/or wide, the overhead and time of physically
loading the data into memory as well as the hardware requirements might make this
approach untenable. The other solution would be to join the source and destination
within the context of a relational update operation (thereby updating the existing rows)
followed by an insert operation that uses a left outer join to determine which rows to
insert (to insert the new rows). The bottleneck in this solution is that the two operations
must be done serially and you incur the expense of two joins.
In SQL Server 2008 R2, the merge operator solves this type of issue. The merge
operator enables you to specify how each row from the source should be treated in
relation to the destination. Since it uses only one (internal) join, it offers increased
performance, and coupled with granular syntax options it delivers a powerful solution.
For example, the following Transact-SQL script based on the AdventureWorks and
AdventureWorksDW sample databases shows how a Customer dimension table can be
updated or have rows inserted into it by using a single MERGE statement.
USE AdventureWorksDW;
GO
MERGE dbo.DimCustomer AS [Dest]
USING (
SELECT
ContactID,
N'AW' + RIGHT(N'0000000' + CONVERT(NVARCHAR(10), ContactID), 8) AS [Key],
FirstName,
LastName,
Phone
FROM AdventureWorks.Person.Contact
) AS [Source]
ON [Dest].CustomerAlternateKey = [Source].[Key]
WHEN MATCHED THEN UPDATE SET
[Dest].FirstName = [Source].FirstName,
[Dest].LastName = [Source].LastName,
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
25
[Dest].Phone = [Source].Phone
WHEN TARGET NOT MATCHED THEN INSERT (
GeographyKey, CustomerAlternateKey, FirstName,
LastName, Phone, DateFirstPurchase
) VALUES (
1, [Source].[Key], [Source].FirstName,
[Source].LastName, [Source].Phone, GETDATE()
);
Minimally Logged Insert
Minimally logged insert can have a large impact on the load performance of the ETL
process. Rather than writing data to the log and then to disk as is required for recovery
under write-ahead logging, it is possible to write the data to disk only once if full
recovery is not desired, for example, when inserting large amounts of data into a table,
such as during ETL.
Minimal logging was introduced in SQL Server 2005 and refers to logging only the
information that is required to roll back the transaction without supporting point-in-time
recovery. Minimal logging is only available under the bulk logged and simple recovery
models. Operations that can be minimally logged in SQL Server 2005 include bulk
import operations, SELECT INTO, and index creation and rebuild. SQL Server 2008
extended the optimization to INSERT INTO…SELECT FROM Transact-SQL operations
that insert a large number of rows into an existing table under either of the following
conditions:

Inserting into an empty table that has a clustered index and no non-clustered
indexes

Inserting into a heap that has no indexes but that can be non-empty
Minimal logging greatly improves large-scale INSERT operations by increasing
performance and reducing the amount of log space required. For a discussion of table
structure and DML statement requirements for minimal logging, see SQL Server Books
Online.
Periodic Maintenance
This section covers the maintenance tasks that need to be done periodically in order to
run a smooth SQL Server computer. This includes, but is not limited to, the following
tasks:

Index maintenance

Statistics maintenance

Database consistency checks
The most challenging tasks in the largest databases are the maintenance tasks. This
can be more difficult than tuning the queries. And since maintenance tasks can
parallelize over many CPUs, they may require more CPUs than you might originally
estimate for your user workload.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
26
Index Maintenance
Indexes on tables that are frequently modified (via insert, update, and delete
statements) need to be rebuilt occasionally. It is recommended that you rebuild your
indexes before they reach 50 percent fragmentation. If your table is too big to
constantly rebuild your indexes, consider partitioning the table and indexes so you can
rebuild one index partition at a time. If that is not feasible, at least repack the indexes;
this does not reclaim the unused space but it does reorder the pages. Use the
sys.dm_db_index_physical_stats dynamic management function to check for the
amount of fragmentation in a table or partition.
Index maintenance is a resource-intensive process if you have one or more indexes on
a large fact table. Scale-up systems such as the HP Integrity Superdome can be very
useful for this process because they can speed up the index maintenance operations,
shortening your maintenance window.
Statistics Maintenance
Statistics maintenance is critical for getting the best query plans for SQL Server 2008
R2. For most databases, it is best to leave the default of having Auto Create
Statistics and Auto Update Statistics turned on. Even with these on, you still need
to periodically update the statistics on your most frequently modified tables. Create
jobs that will run periodically, perhaps once a week.
Statistics are still by table, not by partition. SQL Server does not yet have the ability to
manage statistics by partition.
Use the sampling rate that will give you the best plans. Most databases will be fine with
the default sampling rate. You may want to change the sampling rate if you start
getting bad query plans and discover that increasing the sampling rate gives better
query plans. In this case, you need to experiment with higher rates until you start
getting better plans. Start with FULLSCAN as the sampling rate and use it unless
statistics gathering takes too much time. If you do not have time to do FULLSCAN,
reduce the sampling rate.
Backup
A good backup strategy is essential for a well-managed data center. Use hardware
snapshot backups whenever possible to get near instantaneous backups. If your
hardware does not have that feature, use the compressed backups from SQL
Server 2008 R2 or any of the third-party vendors that supply compressed backup
software. The compression will save you valuable disk space and is faster than the
regular backup.
If your database is very large, striped backups to multiple files and tapes is best for
increasing speed. The number of files you use depends on the saturation rate of your
disk subsystem.
Although a complete discussion of backup and disaster recovery strategies is beyond
the scope of this paper, it is best to use Microsoft Data Protection Manager in your
overall data center protection plans. See the product details at
http://www.microsoft.com/systemcenter/dpm/default.mspx.
For more discussion on these topics, see the following sites:

SQL Server Customer Advisory Team Best Practices
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2

SQL Server Customer Advisory Team Blog

SQL Server Customer Advisory Team Articles
27
Internal Scale Testing During SQL Server 2008
Development
During the development of the SQL Server 2008 release, Microsoft conducted extensive
internal performance testing on a star schema data warehouse. This testing, and the
feedback it provided to the product development team, helped us to achieve sizeable
performance gains on large HP SMP systems compared with SQL Server 2005. HP
worked closely with Microsoft to make these tests possible, providing a 64-core HP
Integrity Superdome for the tests. The results observed also apply to SQL Server 2008
R2.
Workload
Microsoft developed a set of 123 queries to run against the test database. The queries
include:
1. A modified subset of queries from a draft proposal for the TPC-DS benchmarkxxxiii, a
realistic, modern data warehouse benchmark, modified to run against the schema of
the test database,
2. Modified customer queries, and
3. A collection of queries to examine specific additional scenarios (e.g. table scans, and
GROUPING SETS queries).
The experiments were conducted at two different scale points: 600 GB and 6 terabytes.
The data was artificially scaled to these two different sizes. The largest fact table in
each configuration contains 2.5 billion and 25 billion rows, respectively.
Hardware Configuration
The primary hardware we used for internal scale testing for SQL Server 2008
development centered around a 64-core HP Integrity Superdome furnished by HP to
assist with development of the software release. The Superdome was split into three
partitions as follows:
Name
Testing purpose
Cores
Memory
Total available storage
ASDune
Relational data
warehousing with SQL
Server Database
Engine
32
128 GB
26-terabyte RAID 5 (Data/Temp)
7-terabyte RAID 10 (Backup)
2-terabyte RAID 10 (Log)
ASDune1
Analysis Services
16
256 GB
7-terabyte RAID 10
ASDune2
OLTP with SQL Server
Database Engine, and
ETL with Integration
Services
16
64 GB
7-terabyte RAID 5
We used direct-attached storage for the tests, although for ease of management, you
may prefer to use a SAN in your environment. The I/O throughput on the partition of
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
28
the system named ASDune has the following characteristics, as measured by using
SQLIO with a 64-KB block size:
Sequential
reads
Random
reads
Sequential
writes
Random
writes
Data (26 LUNs)
4100 MB/sec
2170 MB/sec
1280 MB/sec
647 MB/sec
Backup (13
LUNs)
2675 MB/sec
1750 MB/sec
1130 MB/sec
800 MB/sec
For additional test runs on the 600GB database, we also used the following computer, a
four-processor, quad-core HP DL585:
Name
Testing purpose
Cores
Memory
DWSDL585
Relational data
warehousing with SQL
engine
16
64 GB
Total available
storage
2 terabyte,
direct attached
The I/O capability of DWSDL585 is as follows:
Sequential
reads
Random
reads
Sequential
writes
Random
writes
Data (HP SA
P800)
320 MB/sec
35 MB/sec
38 MB/sec
31 MB/sec
Tempdb (HP SA
P400)
185 MB/sec
10 MB/sec
11 MB/sec
8 MB/sec
Performance Results
We divided the 123 queries into five groups, from fastest to slowest (Group was 1 the
fastest and Group 5 the slowest), based on their runtimes on a baseline test on SQL
Server 2005 SP2. Following is a description of performance results. All queries were run
“cold start”—the buffer pool was emptied by using DBCC DROPCLEANBUFFERS before
each query. The performance difference improvement for SQL Server 2008 is likely to
be even greater in the case of “warm start” tests that do not clear the buffer pool,
because compression allows a larger percentage of the working set to remain in
memory and reduces I/O.
Computer name
ASDune
Database size (uncompressed)
6 terabyte
Baseline version
SQL Server 2005 SP2
Compared version
SQL Server 2008 Internal Build,
January 2008 (close to RC0 version), with
page compression enabled on fact tables.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
29
Geometric mean of query response
time improvement over all queries (no
change = 100%; higher is better)
254%
Geometric mean of query response
time improvement for shortest to
longest running groups of queries
Group 1 (shortest)
108%
Group 2
309%
Group 3
469%
Group 4
268%
Group 5 (longest)
174%
Sum of total query runtime, Compared
version, as a percentage of Baseline
version, excluding queries that did not
complete within a timeout period on
either version
60% (i.e. on average SQL Server 2008
query response times were only 60% of
those on SQL Server 2005)
Computer Name
DWSDL585
Database size (uncompressed)
600GB
Baseline version
SQL Server 2005 SP2
Compared version
SQL Server 2008 Internal Build,
January 2008 (close to RC0 version), with
page compression enabled on fact tables.
Geometric mean of query response
time improvement over all queries (no
change = 100%; higher is better)
202%
Geometric mean of query response
time improvement for shortest to
longest running groups of queries
Group 1
(shortest)
122%
Group 2
192%
Group 3
252%
Group 4
335%
Group 5
(longest)
176%
Sum of total query runtime, Compared
version, as a percentage of Baseline
version, excluding queries that did not
complete within a timeout period on
either version
37% (on average SQL Server 2008 query
response times were only 37% of those on
SQL Server 2005)
The geometric mean of the query runtime change for each group was computed as the
geometric mean of the ratio of query runtimes, 𝑅(𝑖), between the baseline version and
the compared version. Specifically, for a query number 𝑖,
𝑅(𝑖) = BaselineRuntime(𝑖)/VersionRuntime(𝑖)
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
30
The geometric mean is a special type of average that tends to dampen the effect of
very large and small valuesxxxiv.
What does this mean for you? SQL Server 2008 and SQL Server 2008 R2 are about
twice as fast as SQL Server 2005 for a demanding I/O-bound data warehouse query
workload on the same hardware. The only different feature explicitly used was page
compression. No application changes were required. For a system that was I/O bound in
an earlier version of SQL Server and is no longer I/O bound with SQL Server 2008 or
SQL Server 2008 R2 due to the effects of compression, even greater improvement may
be observed.
We have described repeated, routine scale tests on 600-GB and 6-terabyte data
warehouse configurations. By using the scaling guidelines described in this paper, and
large-scale SMP systems like the HP Integrity Superdome with a high-performance I/O
subsystem and large main memory, or an 8-Processor HP Fast Track Data Warehouse
configuration, SQL Server 2008 and 2008 R2 can readily scale to handle data
warehouse installations of 30 terabytes and beyond.
Conclusion
Scaling up your data warehouse on a large SMP with SQL Server 2008 R2 provides
simple system management, high performance, and resilience to hardware failure. New
features introduced in SQL Server 2008, including partitioned table parallelism, nested
loop join parallelism enhancements, enhanced star joins, and data compression can
give impressive speedups on large SMPs by better using the available CPUs, memory,
and I/O capacity. Many customers can expect their performance to nearly double on the
same hardware, merely by upgrading to SQL Server 2008 R2 from SQL Server 2005 or
below, and compressing their fact tables. Given the increased number of cores and
main memory available at the same price point since SQL Server 2005 was released,
those customers setting up a complete new hardware/software system running SQL
Server 2008 R2 can expect major performance gains. What could SQL Server 2008 R2
and a state-of-the-art SMP computer like an HP Integrity Superdome or an 8-processor
HP Fast Track Data Warehouse system mean for you, your data warehouse installation,
and your business users?
For more information:

SQL Server Web site

SQL Server TechCenter

SQL Server Developer Center
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to
5 (excellent), how would you rate this paper and why have you given it this rating? For
example:

Are you rating it high due to having good examples, excellent screenshots, clear
writing, or another reason?

Are you rating it low due to poor examples, fuzzy screenshots, unclear writing?
This feedback will help us improve the quality of the white papers we release. Send
feedback.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
31
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
32
References
i
Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling (Second Edition), Wiley Publishing, Inc., April 26, 2002.
ii
Joy Mundy, Warren Thornthwaite, and Ralph Kimball, The Microsoft Data Warehouse Toolkit:
With SQL Server 2005 and the Microsoft Business Intelligence Toolset, Wiley Publishing, Inc., Feb 13,
2006.
iii
SQL Server Fast Track Data Warehouse,
http://www.microsoft.com/sqlserver/2008/en/us/fasttrack.aspx.
iv
SQL Server Parallel Data Warehouse, http://www.microsoft.com/sqlserver/2008/en/us/paralleldata-warehouse.aspx.
v
Tuning SQL Server 2005 on HP Integrity Servers, Hewlett Packard Co.,
(http://docs.hp.com/en/8875/WIE-SQLTuning-1006-00.pdf), 2005.
HP reference configuration for Business Intelligence: HP Integrity Superdome servers with 64 Intel
Itanium 2 processors and Microsoft SQL Server 2005,
(http://h71028.www7.hp.com/ERC/downloads/4AA0-8979ENW.pdf), 2005.
Best practices for Microsoft SQL Server 2005 on HP Integrity Superdome servers for Very Large
Database (VLDB) BI solutions, http://h71028.www7.hp.com/ERC/downloads/4AA0-8981ENW.pdf,
2005.
vi
Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling (Second Edition), Wiley Publishing, Inc., April 26, 2002.
vii
Eric N. Hanson et al., An Introduction to New Data Warehouse Scalability Features in SQL
Server 2008, December 2007.
viii
Third Normal Form, Wikipedia, http://en.wikipedia.org/wiki/Third_normal_form, 2008.
ix
Fast Track Data Warehouse 2.0 Architecture, http://technet.microsoft.com/enus/library/dd459178(SQL.100).aspx.
x
Eric N. Hanson, Improving Performance with SQL Server 2005 Indexed Views, May, 2005.
xi
Christopher Adamson, Mastering Data Warehouse Aggregates: Solutions for Star Schema
Performance, Wiley Publishing, Inc., 2006.
xii
Eric N. Hanson et al., An Introduction to New Data Warehouse Scalability Features in
SQL Server 2008, December 2007.
xiii
Christopher Adamson, Mastering Data Warehouse Aggregates: Solutions for Star Schema
Performance, Wiley Publishing, Inc., 2006.
xiv
Sunil Agarwal, Data Compression in SQL Server 2008, white paper in progress, 2008.
xv
Ibid.
xvi
Ibid.
xvii
Eric N. Hanson et al., An Introduction to New Data Warehouse Scalability Features in
SQL Server 2008, December 2007.
Microsoft Corporation ©2010
Scaling up your Data Warehouse with SQL Server 2008 R2
xviii
33
SQL Server Customer Advisory Team Best Practices, 2008.
SQL Server Customer Advisory Team blogs.
SQL Server Customer Advisory Team main site.
xix
Setting Server Configuration Options, SQL Server 2008 Books Online.
xx
Ibid.
xxi
Don Pinto et al., Understanding and Controlling Parallel Query Processing in SQL Server,
http://download.microsoft.com/download%2FB%2FE%2F1%2FBE1AABB3-6ED8-4C3C-AF91448AB733B1AF%2FSQL2008R2_Parallel_QP_Understanding_and_Controlling.docx, 2010.
xxii
Ibid.
xxiii
SQL Server 2008 Resource Governor Best Practices, white paper in progress, 2008.
Resource Governor Concepts, SQL Server 2008 Books Online.
xxiv
Resource Governor Concepts, SQL Server 2008 Books Online.
xxv
Ibid.
xxvi
SQL Server 2008 Resource Governor Best Practices, white paper in progress, 2008.
xxvii
Windows System Resource Manager, 2003.
xxviii
Eric N. Hanson and Lubor Kollar, Statistics Used by the Query Optimizer in
SQL Server 2005, 2005.
xxix
Michael Howard and David LeBlanc, Writing Secure Code, 2nd edition, Microsoft Press, 2003,
pg. 399.
xxx
Microsoft SQL Server Product Samples: SQL Server Sample Databases,
http://www.codeplex.com/MSFTDBProdSamples, 2008.
xxxi
Parallel Query Processing, SQL Server 2008 Books Online, 2008.
xxxii
Len Wyatt, Project REAL: Technical Overview, 2005.
xxxiii
TPC BenchmarkTM DS (Decision Support), Draft Specification, Revision 32, Transaction Processing
Council (TPC), http://www.tpc.org/tpcds/default.asp, 2008.
xxxiv
Geometric mean, http://en.wikipedia.org/wiki/Geometric_mean.
Microsoft Corporation ©2010