High availability and Disaster Recovery in a Multi-Site Virtual Environment using virtualization Henk Den Baes Technology Advisor Microsoft BeLux HA & DR with Multi-Site Clustering Introduction Networking Storage Quorum Session Objectives • Session Objective(s): – Clustering is not too expensive and not that complex – Understanding the need and benefit of multi-site clusters – What to consider as you plan, design, and deploy your first multi-site cluster • Clustering your Hyper-V servers is a great solution for not only high availability, but also disaster recovery Hyper-V Virtualization Scenarios Server Consolidation Test and Dev Business Continuity Dynamic Datacenter Keeping the Business Running Business Continuity Resumption of full operations combining People, Processes and Platforms Disaster Recovery Site-level crisis , data and IT operations resumption Backup and Restore High Availability Presumes infrastructure is whole 97% is file/small unit related Presumes that the rest of the environment is active Business Continuity with Virtualization Business Continuity Virtualization reduces BC costs and minimizes business downtime by: High Availability • • • Disaster Recovery Backup and Recovery increasing the availability of infrastructure extending protection to more applications simplifying backups, recovery and DR testing Disaster Recovery Clustering VHD Primary Site Secondary Site Shared Storage CSV Quick/Live Migration Backup/Recovery Storage Array Backup/Recovery Storage Array Backup/Recovery Differences Between Single-site & Multi-site Clusters Two node single-site cluster Primary Site Secondary Site WAN Connectivity VMs move between nodes on the same SAN and share common storage VMs move between physical nodes on different SANs and without true shared storage between the sites SAN SAN SAN SAN Replication Storage Array Single-site Cluster Primary Storage Array Secondary Storage Array Multi-site Cluster Microsoft Stretch Clustering & Storage Continuity Geographically distributed clusters are extended to different physical locations Stretch Clustering automatically fails VMs over to a geographically different site Replicated data from site A Primary site data is replicated to the secondary site Storage Array Primary Site Storage Array Secondary Site Multi-site stretch configurations can provide automatic fail-over Stretch clustering uses the same concept as local site clustering Storage array or third party software provides SAN data replication Benefits of a Multi-Site Cluster • Protects against loss of an entire datacenter • Automates failover – Reduced downtime – Lower complexity disaster recovery plan • Reduces administrative overhead – Automatically synchronize application and cluster changes – Easier to keep consistent than standalone servers The primary reason DR solutions fail is dependence on people DR: NMBS VDI use case - NOC • Windows 7 master • NOC is installed on 1 site – DRP is costly & has to be tested yearly – There is no automatic app. Sync – Dedicated master, manual upgrades – No persistent image, need for admin rights • Solution: – Remote Desktop Services – Hyper-V R2 – Remote Desktop Connection HA & DR with Multi-Site Clustering Introduction Networking Storage Quorum Network Considerations • Network Deployment Options: 1. Stretch VLAN’s across sites 2. Cluster nodes can reside in different subnets Public Network Site A Site B 10.10.10.1 30.30.30.1 20.20.20.1 40.40.40.1 Redundant Network Stretching the Network • Longer distance traditionally means greater network latency • Missed inner-node health checks can cause false failover • Cluster inner-node heartbeating is fully configurable • SameSubnetDelay (default = 1 second) – Frequency heartbeats are sent • SameSubnetThreshold (default = 5 heartbeats) – Missed heartbeats before an interface is considered down • CrossSubnetDelay (default = 1 second) – Frequency heartbeats are sent to nodes on dissimilar subnets • CrossSubnetThreshold (default = 5 heartbeats) – Missed heartbeats before an interface is considered down to nodes on dissimilar subnets • • Command Line: Cluster.exe /prop PowerShell (R2): Get-Cluster | fl * Updating VM’s IP on Subnet Failover • On cross-subnet failover, if guest is… DHCP Static IP • IP updated automatically • Admin needs to configure new IP • Can be scripted • Best to use DHCP in guest OS for cross-subnet failover Client Reconnect Considerations • Nodes in dissimilar subnets • VM obtains new IP address • Clients need that new IP Address from DNS to reconnect DNS Replication DNS Server 1 Record Created Site A 10.10.10.111 DNS Server 2 Record Updated Site B 20.20.20.222 Record Updated VM = 20.20.20.222 Solutions • Solution #1: Prefer Local Failover – Scale up for local failover for higher availability • No change in IP addresses for HA • Means not going over the WAN and is still usually preferred – Cross-site failover for disaster recovery • Solution #2: Stretch VLAN’s – Deploying a VLAN minimizes client reconnection times • IP of the VM never changes • Solution #3: Abstraction in Network Device – Network device uses 3rd IP – 3rd IP is the one registered in DNS & used by client HA & DR with Multi-Site Clustering Introduction Networking Storage Quorum Storage in Multi-Site Clusters • Different than local clusters: – Multiple storage arrays – independent per site – Nodes commonly access own site storage – No ‘true’ shared disk visible to all nodes Site A Site BSite B SAN Storage Considerations Site A Site A Site BSite B SAN Replica Changes are made on Site A and replicated to Site B Requires data replication mechanism between sites Synchronous Replication • Host receives “write complete” response from the storage after the data is successfully written on both storage devices Replication Write Request Write Complet e Acknowled gement Primary Storage Seconda ry Storage Asynchronous Replication • Host receives “write complete” response from the storage after the data is successfully written to just the primary storage device, then replication Replication Write Request Seconda ry Storage Write Complet e Primary Storage Synchronous vs. Asynchronous Synchronous No data loss Requires high bandwidth/low latency connection Stretches over shorter distances Write latencies impact application performance Asynchronous Potential data loss on hard failures Enough bandwidth to keep up with data replication Stretches over longer distances No significant impact on application performance Hardware Replication Partners • Hardware storage-based replication Software Replication Partners Software host-based replication Double-Take Availability SteelEye DataKeeper Cluster Edition Symantec Storage Foundation for Windows Storage Virtualization Abstraction • Some replication solutions provide complete abstraction in storage array • Servers are unaware of accessible disk location • Fully compatible with Cluster Shared Volumes (CSV) Site A Site B Servers abstracted from storage Virtualized storage presents logical LUN Focus on Double-Take for Hyper-V • Product Features – Host level filter driver replication – Simplified management • Auto discovery, guest level policies, & guest protection schema • Not a file level protection product (block based) • One click failover and failover management – WAN support (bandwidth throttling, compression…) – Integration with SCOM and SCVMM • All managed via one familiar console • Licensed per Hyper-V Host – Unlimited number of VMs Basic Double-Take Configuration How Double-Take Replication Works Applications Applications Any IP Network Operating System Operating System DoubleTake Filter File System Hardware Layer WAN Optimized Three Levels of Data Compression and Scheduled Bandwidth Limiting Capabilities Initial Mirror of Data File System Hardware Layer Host-Level Protection for Hyper-V Hyper-V Host VHD VHD VHD VHD VHD VHD Hyper-V Host Double-Take GeoCluster • Integrates with Microsoft Failover Clustering • Uses Double-Take Patented Replication • Extends Clusters Across Geographical Distances • Eliminates Single Point of Disk Failure How Double-Take GeoCluster Works At failover, the new active node resumes with current, replicated data Only the active node accesses its disks Data is replicated to all passive nodes Replication GeoCluster nodes use separate disks, kept synchronized by real-time replication GeoCluster for Hyper-V Workloads • Product Features – Provides redundancy of storage – Allows cluster nodes to be geographically distributed – Utilizes GeoCluster technology to extend Hyper-V clustering across virtual hosts without the use of shared disk – Replicates cluster data to a secondary node, eliminating single point of failure – Allows manual and automatic moves of cluster resources between virtual hosts CSV with Replicated Storage • Traditional architectural assumptions may collide… – Traditional replication solutions typically assume only 1 array accessed at a time – Cluster Shared Volumes assumes all nodes can concurrently access a LUN • Talk to your storage vendor for their support story Site B Site A VM attempts to access replica VHD Read/Write Read/Only HA & DR with Multi-Site Clustering Introduction Networking Storage Quorum Quorum Overview Majority is greater than 50% Possible Voters: Nodes (1 each) + 1 Witness (Disk or File Share) 4 Quorum Types • • • • Disk only (not recommended) Node and Disk majority Vote Vote Vote Node majority Node and File Share majority Vote Vote Replicated Disk Witness • A witness is a tie breaker when nodes lose network connectivity – When a witness is not a single decision maker, problems occur • Do not use in multi-site clusters unless directed by vendor Vote Vote Vote ? Replicated Storage Node Majority Can I communicate with majority of the nodes in the cluster? 5 Node Cluster: Majority = 3 Yes, then Stay Up Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Site B Membership Site A Cross site network connectivity broken! Majority in Primary Site Node Majority We are down! 5 Node Cluster: Majority = 3 Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Membership Site B Site A Need to force quorum manually Disaster at Site 1 Majority in Primary Site Forcing Quorum • Forcing quorum is a way to manually override and start a node even though it has not achieved quorum – – – – Always understand why quorum was lost Used to bring cluster online without quorum Cluster starts in a special “forced” state Once majority achieved, drops out of “forced” state • Command Line: – net start clussvc /fixquorum (or /fq) • PowerShell (R2): – Start-ClusterNode –FixQuorum (or –fq) Multi-Site with File Share Witness Site C (branch office) SCENARIO: Complete resiliency and automatic recovery from the loss of any 1 site Site A File Share Witness \\Foo\Share WAN Site B Multi-Site with File Share Witness Can I communicate with majority of the nodes (+FSW) in the cluster? Yes, then Stay Up SCENARIO: Complete resiliency and automatic recovery from the loss of connection between sites Site A Site C (branch office) File Share Witness Can I communicate with majority of the nodes in the cluster? \\Foo\Share No (lock failed), drop out of Cluster Membership WAN Site B File Share Witness (FSW) Considerations • Simple Windows File Server • Single file server can serve as a witness for multiple clusters – Each cluster requires it’s own share – Can be made highly available on a separate cluster • Recommended to be at 3rd separate site to enable automatic site failover • FSW cannot be on a node in the same cluster • FSW should not be in a VM running on the same cluster Quorum Model Recap Node and File Share Majority • Even number of nodes • Best availability solution – FSW in 3rd site Node Majority • Odd number of nodes • More nodes in primary site Node and Disk Majority • Use as directed by vendor No Majority: Disk Only • Not Recommended • Use as directed by vendor Datacenter Recovery Partners • Citrix Essentials for Hyper-V augments Hyper-V DR by automating disaster recovery configuration – – – – – – – StorageLink Site Recovery manages storage automation Workflow orchestration for VM site failover Non-disruptive testing & staging of VM prior to failover Single click failback Recovery plans Integrates with SCVMM Plus more… Microsoft Site Recovery Solution Stack End to end Disaster Recovery Management Automation Server and Application Availability • • • Hyper-V Clustering Quick and Live Migration • Storage and Data Availability • • Physical and Virtual Performance and Resource Optimization • • Storage Partner Data Replication Synchronous & Asynchronous Replication Array state and application restart • • Workflow automation DR Run-book Simplified configuration & testing Microsoft Private Cloud – Server Platform Data Protection & Recovery Design, Configure & Deploy Simplify with integrated physical, virtual and cloud management Improve agility with private cloud computing infrastructure Virtualize, Deploy & Manage Optimize service delivery across datacenter infrastructure and business critical services Lower costs through automation “We don’t have to manage our infrastructure with multiple tools…we have one central monitoring and management console from which we can care for every aspect of our environment” - Doug Miller, Practice Architect, Microsoft Practice Group, CDW © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.