High availability and Disaster Recovery in a Multi-Site

High availability and Disaster Recovery
in a Multi-Site Virtual
Environment
using virtualization
Henk Den Baes
Technology Advisor
Microsoft BeLux
HA & DR with Multi-Site Clustering
Introduction
Networking
Storage
Quorum
Session Objectives
• Session Objective(s):
– Clustering is not too expensive and not that complex
– Understanding the need and benefit of multi-site clusters
– What to consider as you plan, design, and deploy your first
multi-site cluster
• Clustering your Hyper-V servers is a great solution for
not only high availability, but also disaster recovery
Hyper-V Virtualization Scenarios
Server Consolidation
Test and Dev
Business Continuity
Dynamic Datacenter
Keeping the Business Running
Business Continuity
Resumption of full operations combining People, Processes and Platforms
Disaster Recovery
Site-level crisis , data and IT operations resumption
Backup and Restore
High Availability
Presumes infrastructure is whole
97% is file/small unit related
Presumes that the rest of the
environment is active
Business Continuity with Virtualization
Business Continuity
Virtualization reduces BC costs and
minimizes business downtime by:
High Availability
•
•
•
Disaster Recovery
Backup and Recovery
increasing the availability of infrastructure
extending protection to more applications
simplifying backups, recovery and DR testing
Disaster Recovery
Clustering
VHD
Primary Site
Secondary Site
Shared Storage
CSV
Quick/Live
Migration
Backup/Recovery
Storage
Array
Backup/Recovery
Storage
Array
Backup/Recovery
Differences Between Single-site &
Multi-site Clusters
Two node single-site cluster
Primary Site
Secondary Site
WAN
Connectivity
VMs move between nodes on the
same SAN and share common
storage
VMs move between physical nodes
on different SANs and without true
shared storage between the sites
SAN
SAN
SAN
SAN Replication
Storage Array
Single-site Cluster
Primary Storage
Array
Secondary
Storage Array
Multi-site Cluster
Microsoft Stretch Clustering & Storage
Continuity
Geographically
distributed clusters are
extended to different
physical locations
Stretch Clustering
automatically fails VMs over
to a geographically different
site
Replicated
data from
site A
Primary site data is
replicated to the
secondary site
Storage Array
Primary Site
Storage Array
Secondary Site
Multi-site stretch configurations can provide automatic
fail-over
Stretch clustering uses
the same concept as
local site clustering
Storage array or third
party software
provides SAN data
replication
Benefits of a Multi-Site Cluster
• Protects against loss of an entire datacenter
• Automates failover
– Reduced downtime
– Lower complexity disaster recovery plan
• Reduces administrative overhead
– Automatically synchronize application and cluster changes
– Easier to keep consistent than standalone servers
The primary reason DR solutions fail is
dependence on people
DR: NMBS VDI use case - NOC
• Windows 7 master
• NOC is installed on 1 site
– DRP is costly & has to be tested
yearly
– There is no automatic app. Sync
– Dedicated master, manual
upgrades
– No persistent image, need for
admin rights
• Solution:
– Remote Desktop Services
– Hyper-V R2
– Remote Desktop Connection
HA & DR with Multi-Site Clustering
Introduction
Networking
Storage
Quorum
Network Considerations
• Network Deployment Options:
1. Stretch VLAN’s across sites
2. Cluster nodes can reside in different subnets
Public
Network
Site A
Site B
10.10.10.1
30.30.30.1
20.20.20.1
40.40.40.1
Redundant
Network
Stretching the Network
• Longer distance traditionally means greater network latency
• Missed inner-node health checks can cause false failover
• Cluster inner-node heartbeating is fully configurable
•
SameSubnetDelay (default = 1 second)
– Frequency heartbeats are sent
•
SameSubnetThreshold (default = 5 heartbeats)
– Missed heartbeats before an interface is considered down
•
CrossSubnetDelay (default = 1 second)
– Frequency heartbeats are sent to nodes on dissimilar subnets
•
CrossSubnetThreshold (default = 5 heartbeats)
– Missed heartbeats before an interface is considered down to nodes on dissimilar subnets
•
•
Command Line: Cluster.exe /prop
PowerShell (R2): Get-Cluster | fl *
Updating VM’s IP on Subnet Failover
• On cross-subnet failover, if guest is…
DHCP
Static IP
• IP updated automatically
• Admin needs to configure new IP
• Can be scripted
• Best to use DHCP in guest OS for cross-subnet failover
Client Reconnect Considerations
• Nodes in dissimilar subnets
• VM obtains new IP address
• Clients need that new IP Address from DNS to reconnect
DNS Replication
DNS Server 1
Record Created
Site A
10.10.10.111
DNS Server 2
Record Updated
Site B
20.20.20.222
Record
Updated
VM = 20.20.20.222
Solutions
• Solution #1: Prefer Local Failover
– Scale up for local failover for higher availability
• No change in IP addresses for HA
• Means not going over the WAN and is still usually preferred
– Cross-site failover for disaster recovery
• Solution #2: Stretch VLAN’s
– Deploying a VLAN minimizes client reconnection times
• IP of the VM never changes
• Solution #3: Abstraction in Network Device
– Network device uses 3rd IP
– 3rd IP is the one registered in DNS & used by client
HA & DR with Multi-Site Clustering
Introduction
Networking
Storage
Quorum
Storage in Multi-Site Clusters
• Different than local clusters:
– Multiple storage arrays – independent per site
– Nodes commonly access own site storage
– No ‘true’ shared disk visible to all nodes
Site
A
Site BSite B
SAN
Storage Considerations
Site A
Site A
Site BSite B
SAN
Replica
Changes are made on Site A and
replicated to Site B
Requires data replication mechanism
between sites
Synchronous Replication
• Host receives “write complete” response from the storage after the
data is successfully written on both storage devices
Replication
Write
Request
Write
Complet
e
Acknowled
gement
Primary
Storage
Seconda
ry
Storage
Asynchronous Replication
• Host receives “write complete” response from the storage after the data is
successfully written to just the primary storage device, then replication
Replication
Write
Request
Seconda
ry
Storage
Write
Complet
e
Primary
Storage
Synchronous vs. Asynchronous
Synchronous
No data loss
Requires high
bandwidth/low latency
connection
Stretches over shorter
distances
Write latencies impact
application performance
Asynchronous
Potential data loss on hard
failures
Enough bandwidth to keep up
with data replication
Stretches over longer distances
No significant impact on
application performance
Hardware Replication Partners
• Hardware storage-based replication
Software Replication Partners
Software host-based replication
Double-Take
Availability
SteelEye DataKeeper
Cluster Edition
Symantec Storage
Foundation for
Windows
Storage Virtualization Abstraction
• Some replication solutions provide complete abstraction
in storage array
• Servers are unaware of accessible disk location
• Fully compatible with Cluster Shared Volumes (CSV)
Site A
Site B
Servers abstracted
from storage
Virtualized storage
presents logical
LUN
Focus on Double-Take for Hyper-V
• Product Features
– Host level filter driver replication
– Simplified management
• Auto discovery, guest level policies, & guest protection schema
• Not a file level protection product (block based)
• One click failover and failover management
– WAN support (bandwidth throttling, compression…)
– Integration with SCOM and SCVMM
• All managed via one familiar console
• Licensed per Hyper-V Host
– Unlimited number of VMs
Basic Double-Take Configuration
How Double-Take Replication Works
Applications
Applications
Any IP Network
Operating
System
Operating
System
DoubleTake Filter
File System
Hardware
Layer
WAN Optimized
Three Levels of Data Compression
and Scheduled Bandwidth Limiting
Capabilities
Initial Mirror of Data
File System
Hardware
Layer
Host-Level Protection for Hyper-V
Hyper-V
Host
VHD
VHD
VHD
VHD
VHD
VHD
Hyper-V
Host
Double-Take GeoCluster
• Integrates with Microsoft Failover Clustering
• Uses Double-Take Patented Replication
• Extends Clusters Across Geographical Distances
• Eliminates Single Point of Disk Failure
How Double-Take GeoCluster Works
At failover, the
new active node
resumes with
current,
replicated data
Only the active
node accesses
its disks
Data is replicated to all
passive nodes
Replication
GeoCluster nodes use separate disks, kept
synchronized by real-time replication
GeoCluster for Hyper-V Workloads
• Product Features
– Provides redundancy of storage
– Allows cluster nodes to be geographically distributed
– Utilizes GeoCluster technology to extend Hyper-V clustering
across virtual hosts without the use of shared disk
– Replicates cluster data to a secondary node, eliminating
single point of failure
– Allows manual and automatic moves of cluster resources
between virtual hosts
CSV with Replicated Storage
• Traditional architectural assumptions may collide…
– Traditional replication solutions typically assume only 1
array accessed at a time
– Cluster Shared Volumes assumes all nodes can
concurrently access a LUN
• Talk to your storage vendor for their support story
Site B
Site A
VM attempts to
access replica
VHD
Read/Write
Read/Only
HA & DR with Multi-Site Clustering
Introduction
Networking
Storage
Quorum
Quorum Overview
Majority is greater than 50%
Possible Voters:
Nodes (1 each) + 1 Witness (Disk or File Share)
4 Quorum Types
•
•
•
•
Disk only (not recommended)
Node and Disk majority
Vote
Vote
Vote
Node majority
Node and File Share majority
Vote
Vote
Replicated Disk Witness
• A witness is a tie breaker when nodes lose network connectivity
– When a witness is not a single decision maker, problems occur
• Do not use in multi-site clusters unless directed by vendor
Vote
Vote
Vote
?
Replicated
Storage
Node Majority
Can I communicate
with majority of
the nodes in the
cluster?
5 Node Cluster:
Majority = 3
Yes, then Stay Up
Can I communicate
with majority of
the nodes in the
cluster?
No, drop out of
Cluster
Site
B
Membership
Site A
Cross site
network
connectivity
broken!
Majority in
Primary Site
Node Majority
We are down!
5 Node Cluster:
Majority = 3
Can I communicate
with majority of the
nodes in the cluster?
No, drop out of
Cluster Membership
Site B
Site A
Need to force
quorum
manually
Disaster at Site 1
Majority in
Primary Site
Forcing Quorum
• Forcing quorum is a way to manually override and start a
node even though it has not achieved quorum
–
–
–
–
Always understand why quorum was lost
Used to bring cluster online without quorum
Cluster starts in a special “forced” state
Once majority achieved, drops out of “forced” state
• Command Line:
– net start clussvc /fixquorum
(or /fq)
• PowerShell (R2):
– Start-ClusterNode –FixQuorum
(or –fq)
Multi-Site with File Share Witness
Site C (branch office)
SCENARIO:
Complete resiliency
and automatic
recovery from the loss
of any 1 site
Site A
File Share
Witness
\\Foo\Share
WAN
Site B
Multi-Site with File Share Witness
Can I communicate
with majority of
the nodes (+FSW)
in the cluster?
Yes, then Stay Up
SCENARIO:
Complete
resiliency and
automatic
recovery from
the loss of
connection
between sites
Site A
Site C (branch office)
File Share
Witness
Can I communicate
with majority of the
nodes in the cluster?
\\Foo\Share
No (lock failed), drop
out of Cluster
Membership
WAN
Site B
File Share Witness (FSW)
Considerations
• Simple Windows File Server
• Single file server can serve as a witness
for multiple clusters
– Each cluster requires it’s own share
– Can be made highly available on a separate cluster
• Recommended to be at 3rd separate site to enable
automatic site failover
• FSW cannot be on a node in the same cluster
• FSW should not be in a VM running on the same
cluster
Quorum Model Recap
Node and File
Share Majority
• Even number of nodes
• Best availability solution – FSW in 3rd site
Node Majority
• Odd number of nodes
• More nodes in primary site
Node and Disk
Majority
• Use as directed by vendor
No Majority:
Disk Only
• Not Recommended
• Use as directed by vendor
Datacenter Recovery Partners
• Citrix Essentials for Hyper-V augments Hyper-V DR
by automating disaster recovery configuration
–
–
–
–
–
–
–
StorageLink Site Recovery manages storage automation
Workflow orchestration for VM site failover
Non-disruptive testing & staging of VM prior to failover
Single click failback
Recovery plans
Integrates with SCVMM
Plus more…
Microsoft Site Recovery Solution Stack
End to end Disaster Recovery
Management
Automation
Server and Application
Availability
•
•
•
Hyper-V
Clustering
Quick and
Live
Migration
•
Storage and Data Availability
•
•
Physical and
Virtual
Performance
and Resource
Optimization
•
•
Storage Partner Data
Replication
Synchronous &
Asynchronous
Replication
Array state and
application
restart
•
•
Workflow
automation
DR Run-book
Simplified
configuration
& testing
Microsoft Private Cloud – Server Platform
Data Protection
& Recovery
Design, Configure
& Deploy
Simplify with integrated physical, virtual and
cloud management
Improve agility with private cloud computing
infrastructure
Virtualize, Deploy
& Manage
Optimize service delivery across datacenter
infrastructure and business critical services
Lower costs through automation
“We don’t have to manage our infrastructure with multiple tools…we have one central monitoring and management console from
which we can care for every aspect of our environment” - Doug Miller, Practice Architect, Microsoft Practice Group, CDW
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.