Scott Schnoll Exchange Server 2013 Site Resilience Agenda • • • • The Preferred Architecture Namespace Planning and Principles Datacenter Switchovers and Failovers Dynamic Quorum and DAGs The Preferred Architecture Site Resilience changes in Exchange 2013 DNS resolves to multiple IP addresses HTTP clients have built-in IP failover capabilities Clients skip past IPs that produce hard TCP failures Single or multiple namespace options Admins can switchover by removing VIP from DNS or disabling No dealing with DNS latency Preferred Architecture For a site resilient datacenter pair, a single namespace / protocol is deployed across both datacenters autodiscover.contoso.com HTTP: mail.contoso.com IMAP: imap.contoso.com SMTP: smtp.contoso.com Load balancers are configured without session affinity, one VIP / datacenter Round-robin, geo-DNS, or other solutions are used to distribute traffic equally across both datacenters mail VIP mail VIP Preferred Architecture • Each datacenter should be its own Active Directory site • Deploy unbound DAG model spanning each DAG across two datacenters • Distribute active copies across all servers in the DAG • Deploy 4 copies, 2 copies in each datacenter • One copy will be a lagged copy (7 days) with automatic play down enabled • Native Data Protection is used • Single network is used for MAPI and replication traffic • Third datacenter used for Witness server, if possible • Increase DAG size density before creating new DAGs mail VIP DAG Witness Server mail VIP Preferred Architecture Selina (somewhere in NA) DNS Resolution na VIP DAG na.contoso.com eur.contoso.com na VIP Batman (somewhere in Europe) DNS Resolution eur VIP DAG eur VIP Namespace Planning & Principles Namespace Planning • No need for namespaces required by Exchange 2010 • Can still deploy regional namespaces to control traffic • Can still have specific namespaces for protocols • Two namespace models • Bound Model • Unbound Model • Leverage split-DNS to minimize namespaces and control connectivity • Deploy separate namespaces for internal and external Outlook Anywhere host names Bound Model Sue (somewhere in NA) mail VIP DNS Resolution mail.contoso.com mail2.contoso.com mail2 VIP DAG1 Active Passive Passive Active DAG2 Jane DNS Resolution (somewhere in NA) Unbound Model Sue (somewhere in NA) DNS Resolution mail.contoso.com Round-Robin between # of VIPs VIP #1 DAG VIP #2 Load Balancing • Exchange 2013 no longer requires session affinity to be maintained on the load balancer • For each protocol session, CAS now maintains a 1:1 relationship with the Mailbox server hosting the user’s data • Load balancer configuration and health probes will factor into namespace design • Remember to configure health probes to monitor healthcheck.htm, otherwise LB and MA will be out of sync Single Namespace / Layer 4 CAS health check OWA ECP mail.contoso.com autodiscover.contoso.com Layer 4LB User EWS EAS OAB RPC MAPI AutoD Single Namespace / Layer 7 CAS health check OWA ECP mail.contoso.com autodiscover.contoso.com Layer 7LB User EWS EAS OAB RPC MAPI AutoD Multiple Namespaces / Layer 4 User CAS mail.contoso.com OWA ecp.contoso.com ECP ews.contoso.com oab.contoso.com Layer 4LB eas.contoso.com EWS EAS OAB oa.contoso.com RPC mapi.contoso.com MAPI autodiscover.contoso.com AutoD Datacenter Switchovers and Failovers Witness Server Placement Choose based on business needs and available options Automatic recovery on datacenter loss; Third location network infrastructure must have independent failure modes Deployment scenario Recommendations DAG(s) deployed in a single datacenter Locate witness server in the same datacenter as DAG members; can share one server across DAGs DAG(s) deployed across two datacenters; No additional locations available Locate witness server in primary datacenter; can share one server across DAGs DAG(s) deployed across two+ datacenters Locate witness server in third location; can share one server across DAGs Site Resilience - CAS With multiple VIP endpoints sharing the same namespace, if one VIP fails, Removing failing IP from DNS puts failover you in control of in service clients automatically to alternate VIP! time of VIP mail.contoso.com: mail.contoso.com: 192.168.1.50, 10.0.1.50 10.0.1.50 cas1 cas2 Redmond cas3 cas4 Portland Site Resilience - Mailbox Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should occur mbx1 mbx2 mbx3 mbx4 witness Redmond Portland Site Resilience - Mailbox 1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond 2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc 3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland mbx1 mbx2 Redmond mbx3 mbx4 Portland Site Resilience - Mailbox 1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond 2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc 3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland mbx1 mbx2 Redmond mbx3 mbx4 Portland Activation Block Comparison Tool Parameter Value Instance Usage SuspendMailboxDatabaseCopy ActivationOnly N/A Per database copy • Keep active off a working but questionable drive Set-MailboxServer DatabaseCopyAutoActivationPolicy “Blocked” or “Unrestricted” Per server • Used to control active/passive SR configurations and maintenance • Can force admin move Set-MailboxServer DatabaseCopyActivationDisabledAndMoveNow $true or $false Per server • Used to do faster site failovers and maintain database availability • Databases are not blocked from failing back • Continuous move-off operation DatabaseDisabledAndMoveNow Last resort to not move an active! Databases mounted and mail delivery! Best Practices Think of it as rack/site maintenance Flip the bit! Don’t ask repair times, “if outage go…” Humans are the biggest threat to recovery times Dynamic Quorum and DAGs Dynamic Quorum Dynamic Quorum When a node shuts down or crashes, the node loses its quorum vote When a node rejoins the cluster, it regains its quorum vote Dynamic Quorum This is referred to as a “Last Man Standing” scenario Dynamic Quorum To continue running, the cluster must always maintain quorum after a node shutdown or failure Dynamic Quorum Majority of 7 required Dynamic Quorum Majority of 7 4 required X X X Dynamic Quorum Majority of 3 required X X X X Dynamic Quorum Majority of 2 required X X X X X Dynamic Quorum Majority of 2 required X X X X X Dynamic Quorum Majority of 2 required X 0 1 X X X X Dynamic Quorum Majority of 2 required X 1 0 X X X X Dynamic Quorum Majority of 2 required X X 1 0 X X X X Dynamic Quorum Majority of 2 required X X X 0 1 X X X X Dynamic Quorum 0 = does not have quorum vote 1 = has quorum vote Get-ClusterNode <Name> | ft name, *weight, state Name ---EX1 DynamicWeight NodeWeight State ------------- ---------- ----1 1 Up Dynamic Quorum Third-party replication DAGs not tested Dynamic Quorum Generally increases the availability of the cluster Enabled by default, strongly recommended to leave enabled Allows the cluster to continue running in failure scenarios that are not possible when this option is disabled Leave it enabled for majority of DAG members In some cases where a Windows 2008 R2 DAG would have lost quorum, a Windows 2012 DAG can maintain quorum Don’t factor it into availability plans Dynamic Witness Windows Server 2012 R2 and later Witness Offline Witness vote gets removed by the cluster Witness Failure Witness vote gets removed by the cluster Witness Online If necessary, Witness vote is added back by the cluster Questions?