SQL Server 2012 AlwaysOn: Multisite Failover Cluster Instance
SQL Server Technical Article
Writers: Mike Weiner, Sanjay Mishra, Min He
Contributors: Lingwei Li, Mike Anderson (EMC Corporation)
Technical Reviewers: Shaun Tinline-Jones, Steve Howard, Prem Mehra, Paul Burpo, Mike
Ruthruff, Jimmy May, Matt Neerincx, Dan Benediktson, Michael Steineke (Edgenet Inc.), David
P. Smith (ServiceU Corporation)
Published: December 2011
Applies to: SQL Server 2012
Summary: SQL Server Failover Clustering, which includes support for both local and multisite
failover configurations, is part of the SQL Server 2012 AlwaysOn implementation suite,
designed to provide high availability and disaster recovery for SQL Server. The multisite failover
clustering technology has been enhanced significantly in SQL Server 2012. The multisite
failover cluster architecture, enhancements in SQL Server 2012 to the technology, and some
best practices to help with deployment of the technology are the primary focus of this paper.
Copyright
This document is provided “as-is”. Information and views expressed in this document, including URL and
other Internet Web site references, may change without notice. You bear the risk of using it.
Some examples depicted herein are provided for illustration only and are fictitious. No real association
or connection is intended or should be inferred.
This document does not provide you with any legal rights to any intellectual property in any Microsoft
product. You may copy and use this document for your internal, reference purposes.
© 2011 Microsoft. All rights reserved.
2
Contents
Introduction .................................................................................................................................................. 4
SQL Server 2012 Multisite Failover Cluster - Architecture ........................................................................... 4
Components Needed to Build a Multisite Failover Cluster........................................................................... 5
Server Hardware and Operating System .................................................................................................. 6
Storage ...................................................................................................................................................... 6
Network .................................................................................................................................................... 6
Windows Server Failover Cluster (WSFC) Quorum Model ....................................................................... 7
SQL Server Customer Lab Testing with Multisite Failover Cluster Improvements ....................................... 7
Challenges, Mitigations, and Lessons Learned ............................................................................................. 9
Storage Validation Check Requirement .................................................................................................... 9
IP Address Configuration in Failover Cluster Manager with an OR Dependency ................................... 10
Appropriate Quorum Model ................................................................................................................... 11
Network Registration and Client Connectivity after Multi-Subnet SQL Server FCI Failover................... 12
Conclusion ................................................................................................................................................... 13
Appendix ..................................................................................................................................................... 14
Lab Hardware and Software Environment ............................................................................................. 14
Servers................................................................................................................................................. 14
SQL Server ........................................................................................................................................... 14
Storage ................................................................................................................................................ 14
Storage Software................................................................................................................................. 14
3
Introduction
This white paper discusses enhancements and considerations made to the multisite failover cluster
technology in SQL Server 2012. The flow of the content is as follows:




A discussion of multisite failover clustering from an architectural perspective
The components involved in deploying a multisite failover cluster
A description of our lab tests. Lab testing was conducted with a prerelease build of SQL Server
2012, which provided an opportunity to observe failover scenarios and behavior in a multisite
configuration.
A discussion of the challenges, mitigations, and lessons learned to help with real-world
deployments with this technology.
This testing was done with prerelease software. However, the functionality tested in the lab was close to
complete in this build; no major changes are expected in the final production release.
SQL Server 2012 Multisite Failover Cluster - Architecture
When you evaluate high availability options for a Microsoft SQL Server environment, you may notice a
number of features within SQL Server that help applications meet your organization’s availability goals.
SQL Server failover clustering technology has been available in the product as a high availability strategy
for more than a decade. Using SQL Server failover clustering, an instance of SQL Server can run on a
single node within the cluster at any point in time. If the SQL Server instance is unable to run on a node
for some reason (such as hardware failure) it can fail over to another node, providing high availability at
the SQL Server instance level.
Many businesses operate their data centers out of multiple locations, or they may have a secondary
data center in order to provide redundancy, across sites, as a disaster recovery mechanism. A primary
reason for doing this is protection from site failure, be it network, power, infrastructure, or other site
disasters. Many solutions have implemented Windows Server and SQL Server failover clustering with
this multisite model. A multisite failover cluster includes nodes that are dispersed across multiple
physical sites or data centers, with the aim of providing availability across data centers in the case of a
disaster at a site. Sometimes multisite failover clusters are referred to as geographically dispersed
failover clusters, stretch clusters, or multi-subnet clusters.
Currently, to deploy SQL Server 2008 R2 multisite failover clusters, you need to deploy the following
technologies in addition to SQL Server failover clustering:
4

SAN replication and failover technology – to provide data replication and failover capabilities
across sites

Stretch virtual LAN (VLAN) technology – to expose a single IP address that can fail over between
sites if multiple subnets exist in the environment.
In Windows Server 2003, all cluster resource dependencies were AND dependencies. (Note: In Windows
Server 2003, failover clusters were known as server clusters.) For example, if the “SQL Server” resource
depended on the “IP Address” and “Disk 1” resources, the Windows cluster would only be able to bring
the “SQL Server” resource online if both the “IP Address” and “Disk 1” resources were both online first.
Windows Server 2008 introduced the ability to specify OR dependencies between resources; for more
information, see the blog entry Cluster Resource Dependency Expressions
(http://blogs.msdn.com/b/clustering/archive/2008/01/28/7293705.aspx). This addition means that you
can specify that the “SQL Server” resource depends on “Disk 1” AND (“IP Address1” OR “IP Address2”).
This configuration enables each site in a multi-subnet cluster to be registered with a different IP address,
while providing the ability to have a “SQL Server” resource dependency that only needs at least one IP
address to bind to.
SQL Server 2008 R2, however, does not enable the Windows Server 2008 supported OR dependencies
between IP addresses. In SQL Server 2008 R2 and previous versions, SQL Server iterates through all IP
addresses in the failover cluster resource group and attempts to bind to all of them during startup. If any
binding fails, SQL Server startup fails. Therefore, in SQL Server 2008 R2 and previous versions, stretch
VLANs were used to enable SQL Server multisite failover clustering.
Many customers, though, are reluctant to deploy stretch VLANs due to considerations such as security,
cost, complexity, or incompatibility with a corporate standard. This has been a key constraint in limiting
the deployment of SQL Server multisite and multi-subnet clustering.
With SQL Server 2012 improvements have been made to the multisite and specifically multi-subnet
failover clustering implementation. Two major enhancements that were made to support multi-subnet
clustering are:


Cluster Setup support – Now both AddNode, for integrated installation, and
CompleteFailoverCluster, for advanced installation, can intelligently detect a multi-subnet
environment, and automatically set the IP address resource dependency to OR.
SQL Server Engine support – To bring the SQL Server resource online, the SQL Server Engine
startup logic skips binding to any IP address that is not in an online state. The state of the IP
addresses and the OR dependency configuration is displayed in a diagram within the
“Challenges, Mitigations, and Lessons Learned” section.
In the SQL Server Customer Lab, testing was completed on this new feature with a prerelease version of
SQL Server 2012. The rest of this document provides further context as to what is required to configure
a multisite SQL Server failover cluster in SQL Server 2012 and document the lab setup, testing, and
lessons learned from the engagement.
Components Needed to Build a Multisite Failover Cluster
When you build a multisite SQL Server failover cluster, there are a number of components to consider.
These components and other considerations are discussed here.
5
Server Hardware and Operating System
Hardware Configuration: The cluster hardware must be a supported configuration (Windows Server
2008 R2 or later) according to the guidelines listed here: The Microsoft Support Policy for Windows
Server 2008 or Windows Server 2008 R2 Failover Clusters(http://support.microsoft.com/kb/943984).
These guidelines require you to run a validation test on the cluster, which you can do by running the
Cluster Validation Wizard through the Failover Cluster Manager snap-in.
Microsoft Software: Windows Server and SQL Server. Each edition of Windows Server and SQL Server
has different support in terms of number of nodes supported for a failover cluster (instance). Also,
different versions support different functionality for failover clustering. For more information, see
What’s New in Failover Clusters for Windows Server 2008 R2 (http://technet.microsoft.com/enus/library/dd621586(WS.10).aspx). This paper covers some of the changes in SQL Server 2012; a full
discussion of all changes is planned for SQL Server Books Online and other articles that will be published
closer to the SQL Server 2012 final release.
Note: One requirement specific to the implementation of the Windows Server Failover Cluster (WSFC) is
that all nodes within the cluster must be a part of the same domain.
Storage
For the storage there are a few components to consider:



The first consideration is the connectivity with the storage:
 Local connectivity is usually provided through Fibre Channel switched connections, where a
single node has exclusive ownership of LUNs and drives at any given time. On failover another
node can take exclusive ownership of the storage.
 In the multisite cluster scenario, separate storage devices are commonly located at both sites.
While the local nodes need access to the storage, there is also connectivity between the storage
units used to link them together. The type and performance of the connectivity mechanism
between storage arrays is an important part of the solution in terms of failover and I/O
performance.
Second is the storage replication technology that is used to replicate the I/Os between storage
devices across the sites. This technology is provided by the storage vendor.
Finally, the storage vendor also provides a software component to automate the failover between
the storage units and determine which disks are accessible and mounted, within the cluster, on a
failover.
Network
The networking component is also important in a multisite (and multi-subnet) environment. Configuring
the SQL Server instance to have a valid IP address in either subnet is a critical configuration step.
There are a few differences between SQL Server 2012 and previous versions to consider. First, though
SQL Server 2012 has integrated support for a multi-subnet configuration, configuring SQL Server to use a
VLAN or single network, configuration is still valid and supported. Next, in SQL Server 2008 and SQL
6
Server 2008 R2, the Time-To-Live (TTL) and other DNS replication configurations were of critical
importance to consider for failover scenarios and client connectivity. These configurations no longer
need to be addressed in SQL Server 2012 failover clustering, because some network configuration and
client driver enhancements are delivered with the SQL Server 2012 release. For more information, see
“Challenges, Mitigations, and Lessons Learned” later in this paper.
Finally, there are other networking considerations, for example for the heartbeat network for the
Windows cluster, that are important but out of scope for this paper.
Windows Server Failover Cluster (WSFC) Quorum Model
With Windows Server 2008 and Windows Server 2008 R2, four types of quorum configurations are
supported. These quorum models are discussed in Failover Cluster Step-by-Step Guide: Configuring the
Quorum in a Failover Cluster (http://technet.microsoft.com/en-us/library/cc770620(WS.10).aspx). There
are also specific considerations for the quorum model in a multisite failover cluster. The primary
documentation on this is discussed in Requirements and Recommendations for a Multi-site Failover
Cluster, (http://technet.microsoft.com/en-us/library/dd197575(WS.10).aspx), in the “Number of nodes
and corresponding quorum configuration” bullet point.
To summarize the information covered in the links: For multisite failover clustering with an even number
of nodes, the quorum configuration of Node and File Share Majority is a recommended option. Some
tie-breaker, be it disk, node, or file share witness should be utilized. A file share witness is commonly
recommended, because it is usually easier to keep the file share accessible to both sites. For an odd
number of nodes, consider Node Majority. However, in this configuration, if the site with more nodes
(usually the primary) fails, manual intervention is required to force the cluster to start at the secondary
site, because quorum is lost.
SQL Server Customer Lab Testing with Multisite Failover Cluster
Improvements
To observe some of the new multi-subnet functionality, we conducted tests in the SQL Server Customer
Advisory Lab (SQLCAT) in Redmond, Washington, United States. Our primary goals were to configure a
multi-subnet failover cluster between two sites and to run a customer workload against the
configuration in a number of tests.
The lab configuration was as follows.
Hardware and Software:

Two Windows Server 2008 R2 servers at ‘Site A’ and two Windows Server 2008 R2 servers at ‘Site B’

SQL Server 2012 prerelease software configured as a single multisite failover cluster instance (FCI)
Storage:
7
There were two EMC Symmetrix VMAX enterprise storage arrays, one at each site. The arrays were both
configured with two VMAX storage engines and 240 disk drives. The drives were a mixture of Enterprise
Flash Drives (EFD), Fibre Channel, and SATA. For the purposes of the test a portion of the Fibre Channel
drives were presented to the Windows Server 2008 R2 failover cluster in a mirrored configuration. Nine
112-GB volumes were used for data and log storage. One 300-GB volume was used to hold data and log
backups. Each array was connected to the test servers using dual 8-Gbps Fibre Channel connections.
The storage arrays utilized Symmetrix Remote Data Facility (SRDF, which has pointers provided in the
appendix) to send data from the source array to the target array. The source storage devices, called R1
volumes, send data to the target storage devices, called R2 volumes. When a site failover occurs,
SRDF/CE (Cluster Enabler) detects the array replication state as it relates to the WSFC active node.
SRDF/CE also handles all replication state changes.
The arrays communicated using dual 1-Gbps Ethernet connections. The use of Ethernet links allowed the
test team to utilize network latency generation equipment to inject delays into the test, thus simulating
communication over distance.
Storage
Replication
Figure 1: Diagram of the multisite configuration, with storage replication between sites and storage
units
8
Network:
To simulate a multisite network, three logical sites were created. ‘Site A’ hosted two failover cluster
nodes and one of the storage arrays. ‘Site A’ was configured for its own subnet. ‘Site B’ was on a
different subnet that hosted the other storage arrays and nodes in the cluster. A third site/subnet
hosted the Active Directory structure, the fileshare for the Windows Server quorum configuration, and a
single DNS server. Although a third site may not match the architecture of all real-world
implementations, the test results and information taken from the lab should still provide valid
information that you can apply to your organization’s environment.
For more information about considerations for client connectivity and network registration on SQL
Server FCI failover, see “Challenges, Mitigations, and Lessons Learned” later in this paper.
Quorum Model:
In our testing, we used the quorum model Node Majority with Fileshare. We placed the fileshare in a
third subnet that was accessible to each of the other subnets. This is just one of multiple options for the
quorum model in a multi-subnet failover cluster scenario. You should pick the model that makes the
most sense for your organization’s overall implementation. For more information about quorum
models, see “Windows Server Failover Cluster (WSFC) Quorum Model” earlier in this paper.
Workload
In order to provide a more realistic testing scenario, a customer workload was run that was significantly
write oriented (well over 90 percent) and executed approximately 2,000 batches per second to place an
I/O load on the failover cluster environment was utilized. The I/Os were fairly small in size, simulating a
high-throughput OLTP application.
We tested numerous failover scenarios, including manual failover (Move Group) and power shutdown,
through various mechanisms, to the server hosting the running SQL Server FCI. The failover behavior,
both with and without the workload running, was as expected. We discovered some key takeaways and
considerations taken from the testing, which we share in the following section.
Challenges, Mitigations, and Lessons Learned
In our tests and other experiences with multisite failover clustering in the prerelease versions of SQL
Server 2012, we identified a few key observations we think will be important for our customers as you
begin to build and deploy your own failover clustering solutions with SQL Server 2012.
Storage Validation Check Requirement
In a multisite cluster environment with SAN replication, it is expected that storage volumes on one site
are visible only to the nodes on the same site, and that the storage volumes on other site are visible only
to the nodes on that site. Therefore, all the storage is not visible to all the nodes at once, and hence the
storage validation checks may not pass and/or provide warnings. If you skip the storage validation tests,
a message related to lack of support is displayed, such as:
9
“No, I do not require support from Microsoft for this cluster, and therefore do not want to run the
validation tests. When I click Next, continue creating the cluster.”
In this environment, it is expected that the storage validation tests can be skipped, because a multisite
cluster solution does not require passing the storage validation tests, to be fully supported solution. For
more information, see “Geographically dispersed clusters” in the KB article The Microsoft Support Policy
for Windows Server 2008 or Windows Server 2008 R2 Failover Clusters
(http://support.microsoft.com/kb/943984).
Note: Only the storage validation check can be skipped. If all validation is skipped or if there are
warnings or failures in the validation report, SQL Server setup detects this and blocks the installation.
IP Address Configuration in Failover Cluster Manager with an OR Dependency
When you configure a multi-subnet failover cluster, only one IP address must be online. The others can
remain offline until failover to that subnet. Because this may appear incorrect or misconfigured, we’ve
provided an example to show how Failover Cluster Manager displays this configuration. Notice that
depending on which subnet currently hosts the FCI, the Status column for one IP address is set to
Offline and the other to Online.
10
Figure 2: An example in Failover Cluster Manager of the OR dependency setup for IP addresses existing
in multiple subnets
Appropriate Quorum Model
A multisite failover cluster usually spans multiple geographic regions and contains storage components
at each site. Therefore there are specific considerations for the quorum model in this environment. For
more information about these considerations, see “Windows Server Failover Cluster (WSFC) Quorum
Model” earlier in this paper. However, when you run Windows Server cluster validation with the
multisite failover cluster, a message appears, suggesting Node and Disk Majority as the appropriate
quorum model, shown in Figure 3.
11
Validate Quorum Configuration
Validate that the current quorum configuration is optimal for the cluster
Validating cluster quorum settings.
The current quorum configuration is Node and File Share Majority (\\r805-04\quorum).
This quorum model will be able to sustain failures of 2 node(s) if the witness file share remains available and 1
node(s) when the witness file share goes offline or fails
The recommended quorum configuration for the current number of nodes is Node and Disk Majority
This quorum configuration can be changed using the Configure Cluster Quorum wizard. This wizard can be
started from the Failover Cluster Manager console by selecting the cluster name in the left hand pane, then in
the right "actions" pane selecting "More Actions...." and then selecting "Configure Cluster Quorum Settings..."
Figure 3: Display of the output of the cluster validation tool of the quorum configuration
The wizard in the cluster validation tool does not detect whether a particular cluster is a multisite
cluster. It is safe to ignore this recommendation and use a more appropriate quorum model, such as
Node and File Share Majority.
Network Registration and Client Connectivity after Multi-Subnet SQL Server
FCI Failover
In SQL Server 2012 the SQL Server failover cluster network name has the RegisterAllProvidersIP
property enabled for the Network Name resource (virtual network name). This property for the multisubnet FCI provides for all IP addresses that SQL Server is configured to use, to be registered in DNS with
the SQL Server virtual network name. Because all IPs are registered in DNS, a cross-data-center failover
does not require any change to the IP addresses registered in DNS. Eliminating this DNS update allows
for client connections to resolve to the SQL Server failover cluster (virtual network name) quicker, after
a failover.
Newer SQL Server client drivers, including the SQL Server Native Client, have added support for a
keyword MultiSubnetFailover. If the client can enable the MultiSubnetFailover connection option, all the
IP addresses that the SQL Server FCI is able to use are evaluated on a connection and resolved by the
client. This enhancement also helps to improve client connectivity after a failover.
If the client is not using a driver with support for the MultiSubnetFailover keyword (or it is not enabled)
there are a few things to consider:

12
The client driver evaluates the IP addresses in a serial manner. This IP evaluation can extend the
time it takes the client to connect. A recommendation would be to look at increasing the
ConnectionTimeout to add 21 seconds for every additional IP address the SQL Server Network
Name could resolve to. So if a second IP address at a new site is added, you can configure the
new ConnectionTimeout as [previous-ConnectionTimeout] +21 seconds. The formula would be
(X + (N-1) * 21), where X = [current ConnectionTimeout] and N = the number of sites with IP
addresses.

From our testing, instance name-to-port number resolution using the SQL Server Browser
service was not always successful. This would cause problems with clients resolving to named
SQL Server instances. Therefore, for drivers that do not support the MultiSubnetFailover
keyword and are connecting to a SQL Server named instance, we recommend that you use a
static port configuration for the SQL Server instance. The client in this case could connect by
specifying the SQL Server and port number directly in the connection parameters.
Conclusion
SQL Server 2012 AlwaysOn provides customers with flexible design choices for high availability and
disaster recovery. Multisite failover clustering provides instance-level high availability and disaster
recovery as one of the architecture choices under SQL Server AlwaysOn. Significant improvements have
been delivered to the multisite failover clustering technology making it a viable option for high
availability and disaster recovery for a number of environments. The intent of this document was to help
users become more familiar with the technology, help with successful deployments, and make readers
aware of the improvements made to SQL Server 2012 multisite clustering.
For more information and article references in paper:












13
http://www.microsoft.com/sqlserver/en/us/future-editions/mission-critical/SQL-Server-2012high-availability.aspx: SQL Server 2012 High Availability
http://blogs.msdn.com/b/sqlalwayson/: SQL Server AlwaysOn blogs
Cluster Resource Dependency Expressions blog:
http://blogs.msdn.com/b/clustering/archive/2008/01/28/7293705.aspx
The Microsoft Support Policy for Windows Server 2008 or Windows Server 2008 R2 Failover
Clusters: http://support.microsoft.com/kb/943984
What’s New in Failover Clusters for Windows Server 2008 R2: http://technet.microsoft.com/enus/library/dd621586(WS.10).aspx
Failover Cluster Step-by-Step Guide: Configuring the Quorum in a Failover Cluster:
http://technet.microsoft.com/en-us/library/cc770620(WS.10).aspx
Requirements and Recommendations for a Multi-site Failover Cluster:
http://technet.microsoft.com/en-us/library/dd197575(WS.10).aspx
The Microsoft Support Policy for Windows Server 2008 or Windows Server 2008 R2 Failover
Clusters: http://support.microsoft.com/kb/943984
http://sqlcat.com : SQL Server Customer Advisory Team site
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how
would you rate this paper and why have you given it this rating? For example:


Are you rating it high due to having good examples, excellent screen shots, clear writing, or
another reason?
Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.
Send feedback.
Appendix
Lab Hardware and Software Environment
Special thanks to our partners for providing hardware and people resources to the lab in order to
complete this testing.
Servers
 4 Dell R805 machines for SQL Server, each with
o 2 socket quad-core AMD Opteron Processor @2.2GHZ
o 32 GB RAM
 1 Dell R805 machine as a fileshare witness
1 Dell R805 machine as a client to run application workload
SQL Server
 SQL Server 2012 prerelease software
Storage
 2 EMC Symmetrix VMAX SANs – one at each site
EMC and SQL Server: Solutions for SQL Server and Business Intelligence
(http://www.emc.com/solutions/application-environment/microsoft/solutions-for-sql-server-businessintelligence.htm)
Microsoft SQL Server on EMC Symmetrix Storage Systems:
(http://www.emc.com/collateral/software/solution-overview/h2203-ms-sql-svr-symm-ldv.pdf)
Storage Software
o EMC SRDF for SAN level replication
o EMC SRDF/CE Cluster Enabler
EMC Symmetrix SRDF (http://www.emc.com/storage/symmetrix/srdf.htm)
14