Exchange Server 2013 Site Resilience

advertisement
Scott Schnoll
Exchange Server 2013
Site Resilience
Agenda
•
•
•
•
The Preferred Architecture
Namespace Planning and Principles
Datacenter Switchovers and Failovers
Dynamic Quorum and DAGs
The Preferred Architecture
Site Resilience changes in Exchange 2013
DNS resolves to multiple IP addresses
HTTP clients have built-in IP failover capabilities
Clients skip past IPs that produce hard TCP failures
Single or multiple namespace options
Admins can switchover by removing VIP from DNS or disabling
No dealing with DNS latency
Preferred Architecture
For a site resilient datacenter pair, a
single namespace / protocol is
deployed across both datacenters
autodiscover.contoso.com
HTTP: mail.contoso.com
IMAP: imap.contoso.com
SMTP: smtp.contoso.com
Load balancers are configured without
session affinity, one VIP / datacenter
Round-robin, geo-DNS, or other
solutions are used to distribute traffic
equally across both datacenters
mail VIP
mail VIP
Preferred Architecture
• Each datacenter should be its own Active Directory
site
• Deploy unbound DAG model spanning each DAG
across two datacenters
• Distribute active copies across all servers in the
DAG
• Deploy 4 copies, 2 copies in each datacenter
• One copy will be a lagged copy (7 days) with
automatic play down enabled
• Native Data Protection is used
• Single network is used for MAPI and replication
traffic
• Third datacenter used for Witness server, if possible
• Increase DAG size density before creating new
DAGs
mail VIP
DAG
Witness
Server
mail VIP
Preferred Architecture
Selina
(somewhere in NA)
DNS Resolution
na VIP
DAG
na.contoso.com
eur.contoso.com
na VIP
Batman
(somewhere in Europe)
DNS Resolution
eur VIP
DAG
eur VIP
Namespace
Planning & Principles
Namespace Planning
• No need for namespaces required by Exchange 2010
• Can still deploy regional namespaces to control traffic
• Can still have specific namespaces for protocols
• Two namespace models
• Bound Model
• Unbound Model
• Leverage split-DNS to minimize namespaces and control
connectivity
• Deploy separate namespaces for internal and external Outlook Anywhere
host names
Bound Model
Sue
(somewhere in NA)
mail VIP
DNS Resolution
mail.contoso.com
mail2.contoso.com
mail2 VIP
DAG1
Active
Passive
Passive
Active
DAG2
Jane
DNS Resolution
(somewhere in NA)
Unbound Model
Sue
(somewhere in NA)
DNS Resolution
mail.contoso.com
Round-Robin between # of VIPs
VIP #1
DAG
VIP #2
Load Balancing
• Exchange 2013 no longer requires session affinity to be maintained
on the load balancer
• For each protocol session, CAS now maintains a 1:1 relationship with the
Mailbox server hosting the user’s data
• Load balancer configuration and health probes will factor into
namespace design
• Remember to configure health probes to monitor healthcheck.htm, otherwise
LB and MA will be out of sync
Single Namespace / Layer 4
CAS
health check
OWA
ECP
mail.contoso.com
autodiscover.contoso.com
Layer 4LB
User
EWS
EAS
OAB
RPC
MAPI
AutoD
Single Namespace / Layer 7
CAS
health check
OWA
ECP
mail.contoso.com
autodiscover.contoso.com
Layer 7LB
User
EWS
EAS
OAB
RPC
MAPI
AutoD
Multiple Namespaces / Layer 4
User
CAS
mail.contoso.com
OWA
ecp.contoso.com
ECP
ews.contoso.com
oab.contoso.com
Layer 4LB
eas.contoso.com
EWS
EAS
OAB
oa.contoso.com
RPC
mapi.contoso.com
MAPI
autodiscover.contoso.com
AutoD
Datacenter Switchovers
and Failovers
Witness Server Placement
Choose based on business needs and available options
Automatic recovery on datacenter loss;
Third location network infrastructure must have independent failure modes
Deployment scenario
Recommendations
DAG(s) deployed in a single datacenter
Locate witness server in the same datacenter as DAG members; can share one server across DAGs
DAG(s) deployed across two datacenters;
No additional locations available
Locate witness server in primary datacenter; can share one server across DAGs
DAG(s) deployed across two+ datacenters
Locate witness server in third location; can share one server across DAGs
Site Resilience - CAS
With multiple VIP endpoints sharing the same namespace, if one VIP fails,
Removing failing
IP from
DNS puts failover
you in control
of in service
clients
automatically
to alternate
VIP! time of VIP
mail.contoso.com:
mail.contoso.com:
192.168.1.50,
10.0.1.50
10.0.1.50
cas1
cas2
Redmond
cas3
cas4
Portland
Site Resilience - Mailbox
Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file,
automatic failover should occur
mbx1
mbx2
mbx3
mbx4
witness
Redmond
Portland
Site Resilience - Mailbox
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
mbx1
mbx2
Redmond
mbx3
mbx4
Portland
Site Resilience - Mailbox
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
mbx1
mbx2
Redmond
mbx3
mbx4
Portland
Activation Block Comparison
Tool
Parameter
Value
Instance
Usage
SuspendMailboxDatabaseCopy
ActivationOnly
N/A
Per database
copy
• Keep active off a working but
questionable drive
Set-MailboxServer
DatabaseCopyAutoActivationPolicy
“Blocked” or
“Unrestricted”
Per server
• Used to control
active/passive SR
configurations and
maintenance
• Can force admin move
Set-MailboxServer
DatabaseCopyActivationDisabledAndMoveNow
$true or $false
Per server
• Used to do faster site
failovers and maintain
database availability
• Databases are not blocked
from failing back
• Continuous move-off
operation
DatabaseDisabledAndMoveNow
Last resort to not move an active!
Databases mounted and mail delivery!
Best Practices
Think of it as rack/site maintenance
Flip the bit! Don’t ask repair times, “if outage go…”
Humans are the biggest threat to recovery times
Dynamic Quorum and
DAGs
Dynamic Quorum
Dynamic Quorum
When a node shuts down or crashes, the node loses its quorum vote
When a node rejoins the cluster, it regains its quorum vote
Dynamic Quorum
This is referred to as a “Last Man Standing” scenario
Dynamic Quorum
To continue running, the cluster must always maintain quorum after a node shutdown or
failure
Dynamic Quorum
Majority of 7 required
Dynamic Quorum
Majority of 7
4 required
X
X
X
Dynamic Quorum
Majority of 3 required
X
X
X
X
Dynamic Quorum
Majority of 2 required
X
X
X
X
X
Dynamic Quorum
Majority of 2 required
X
X
X
X
X
Dynamic Quorum
Majority of 2 required
X
0
1
X
X
X
X
Dynamic Quorum
Majority of 2 required
X
1
0
X
X
X
X
Dynamic Quorum
Majority of 2 required
X
X
1
0
X
X
X
X
Dynamic Quorum
Majority of 2 required
X
X
X
0
1
X
X
X
X
Dynamic Quorum
0 = does not have quorum vote
1 = has quorum vote
Get-ClusterNode <Name> | ft name, *weight, state
Name
---EX1
DynamicWeight NodeWeight State
------------- ---------- ----1
1
Up
Dynamic Quorum
Third-party replication DAGs not tested
Dynamic Quorum
Generally increases the availability of the cluster
Enabled by default, strongly recommended to leave enabled
Allows the cluster to continue running in failure scenarios that are not possible when this
option is disabled
Leave it enabled for majority of DAG members
In some cases where a Windows 2008 R2 DAG would have lost quorum, a Windows 2012
DAG can maintain quorum
Don’t factor it into availability plans
Dynamic Witness
Windows Server 2012 R2 and later
Witness Offline
Witness vote gets removed by the cluster
Witness Failure
Witness vote gets removed by the cluster
Witness Online
If necessary, Witness vote is added back by the cluster
Questions?
Download