MVP Technical Series Surviving Disaster: Building Site Resiliency for Exchange 2013 Manu Philip ( Exchange Server MVP) New York: 631-345-5292 • Limerick: +353-61-260-101 • Hannover: +49-511-367393-0 • Singapore: +65-62222429 MVP Technical Series • Today– Surviving Disaster: Building Site Resiliency for Exchange 2013, Manu Philip • 14th Mar – Lync 2010 Design Essentials, David Lim • 4th Apr - Ready for Your Cloud: Windows Server 2012’s Way Designed for You, Erdal Ozkaya Manu Philip Microsoft Most Valuable Professional (MVP): Exchange Server • Microsoft Most Valuable Professional (MVP) for Exchange (for the past 3 years) • Exchange Solution Expert, contracting with Global enterprise clients to assist in their Exchange deployments • Founder of: www.exchangeonline.in www.windowsadmin.in • "Moderator" of most of the Microsoft TechNet Community Exchange Server Forums • MCC (Microsoft Community Contributor) of Microsoft TechNet Community Exchange Server Forums The fundamental success of any business depends up on a high resilient messaging system Planning for Site Resiliency in Exchange Environment The design involves identifying the high resilient solution which is matching with the SLA targets. Recovery Time Objective: How long it takes to restore the service to the users Recovery Point Objective: How current the data is after the recovery operation has completed. SLA Factors determines the site resilient questions: In the case of a primary datacenter fail, what level of service is required? After the fail can user manage only messaging services for a time period? How many users should be covered under the resiliency? How the data can be availed to the users? How soon the standby datacenter should be activated? Moving back the service to Primary datacenter General Considerations: Servers in the secondary datacenter must be capable to host the failed mailboxes The necessary network configuration must be in place to support the datacenter switchover. A testing method should also be defined in SLA. This should be validated periodically Site resilience is much better in Exchange 2013 because it has been simplified Site Resiliency overview: Exchange 2013 Exchange 2013 has undergone significant architectural changes from the ground up in order to enhance its Site Resilience capabilities. Site resiliency features can be classified under: Storage/Database Architecture Client Access Architecture Transport/Routing Architecture With the namespace simplification, Exchange 2013 provides new site resilience options, such as the ability to use a single global namespace. In addition, Exchange 2013 also provides the ability to configure the messaging service for automatic failover in response to failures that required manual intervention in Exchange 2010. Multi-site DAG configuration What’s in Exchange 2013 to support advanced Site Resiliency? Datacenter Activation Coordinator (DAC) Lagged Replication Copies Single Global namespace (DNS) Safety Net Multi-site SSL New DAG version number: 2.0. Database Availability Group (DAG) provides both High Availability and Site Resilience features. • DAG Improvements: Multi-site DAG – The configuration of DAG is now easier, and its stability has been improved as well. – The Transactional Log Creation code has been completely rewritten. – Now, we also have many new and enhanced PowerShell cmdlets to perform various DAG operations to suit various situations. – The DAG relies on Windows Server clustering services and utilizes a quorum witness to act as tie-breaker. Quorum can be treat as an voting process in which a majority of voting members must be present to make a decision. The decision in the case of a DAG is basically whether the DAG should be online of offline. Quorum for Exchange Server 2013 Database Availability Groups Because a majority of votes is required for quorum there are two different quorum models used depending on how many DAG members you have. For a DAG with an odd number of members the Node Majority quorum mode is used. For a DAG with an even number of members the Node and File Share Majority quorum mode is used. This mode involves an additional server referred to as the File Share Witness. It is typically another Exchange server located in the same site as the DAG members Multi-site DAG A DAG deployed for site resilience will span multiple datacenters. The objectives of a Database Availability Group deployed for site resilience are usually to provide availability of mailbox services after the complete failure of the primary datacenter. In other words, a true disaster. Set-DatabaseAvailabilityGroup AutoDagDatabaseCopiesPerVolume AutoDagDatabasesRootFolderPath AutoDagVolumesRootFolderPath ManualDagNetworkConfiguration ReplayLagManagerEnabled New DAG Cmdlets Restore-DatabaseAvailabilityGroup UsePrimaryWitnessServer Update-MailboxDatabaseCopy CancelSeed Set-MailboxDatabase AutoDagExcludeFromMonitoring Add-DatabaseAvailabilityGroupServer SkipDAGValidation Get-MailboxDatabaseCopyStatus UseServerCache Split-Brain Syndrome: DAC prevents a DAG from automatically mounting databases after an outage like a network fail and DAG members can't receive heartbeat signals from each other that spans multiple datacenters. Datacenter Activation Coordinator (DAC) Example: First datacenter contains two DAG members and the witness server, and the second datacenter contains two other DAG members with an alternate witness server . If the first datacenter is restored without network connectivity to the second datacenter, the active databases within the DAG may enter a split brain condition. DAC mode is used to control the startup database mount behavior of a DAG after being affected with a catastrophic failure DAC mode is disabled by default DAC uses a protocol called Datacenter Activation Coordination Protocol (DACP). Datacenter Activation Coordinator (DAC) How DAC acts? After a catastrophic failure, DACP is used to determine the current state of the DAG and whether Active Manager should attempt to mount the databases. Active Manager stores a bit in memory (either a 0 or a 1) that tells the DAG whether it's allowed to mount local databases that are assigned as active on the server. When a DAG is running in DAC, each time Active Manager starts up the bit is set to 0, meaning it isn't allowed to mount databases In DAC mode, the server must try to communicate with all other members of the DAG that it knows to get another DAG member to give it an answer as to whether it can mount local databases that are assigned as active to it. If another server responds that its bit is set to 1, it means servers are allowed to mount databases. DAC mode also enables the use of the built-in site resilience cmdlets used to perform datacenter switchovers. Datacenter Activation Coordinator (DAC) Stop-DatabaseAvailabilityGroup Restore-DatabaseAvailabilityGroup Start-DatabaseAvailabilityGroup Enabling DAC mode DAC mode can be enabled only by using the Exchange Management Shell by using the cmdlet Set-DatabaseAvailabilityGroup For Eg: Set-DatabaseAvailabilityGroup -Identity DAG2 -DatacenterActivationMode DagOnly o A lagged database copy is one that is not updated by replaying transactions as they become available. Instead, the transaction logs are kept for a certain period and are then replayed. Lagged Database Copies o A lagged database remains a preset time period behind the live database (up to 7 days) and provides a recovery option in the event that the active mailbox copy encounters corruption. o An organization can enhance the resiliency of their database solution by employing a combination of non-lagged and lagged database copies in a dag. o The best thing about a DAG is that you can achieve resilience against failure by creating multiple copies of databases that Exchange will keep up to date through log shipping. However, some vendors advice exists that the second passive copy should be lagged Lagged Database Copies • Lagged copy enhancements include integration with Safety Net (Similar feature like Transport Dumpster in Exchange 2010) Activating a lagged database copy becomes significantly easier as it uses SafetyNet Lagged copy enhancements in Exchange 2013 For example consider a lagged copy that has a 2-day replay lag. In that case, you would configure Safety Net for a period of 2 days. If you encounter a situation in which you need to use your lagged copy • Mount the lagged copy • This will trigger an automatic request to SafetyNet to redeliver the last two days of mail. • You get the last two days mail, minus the data ordinarily lost on a lossy failover. Lagged copy enhancements in Exchange 2013 • Lagged copies can now care for themselves by invoking automatic log replay to play down the log files in certain scenarios: • When a low disk space threshold is reached • When the lagged copy has physical corruption and needs to be page patched • When there are fewer than three available healthy copies (active or passive) for more than 24 hours • Lagged copy play down behavior is disabled by default, and can be enabled by running the following command. Set-DatabaseAvailabilityGroup <DAGName> ReplayLagManagerEnabled $true • After being enabled, play down occurs when there are fewer than 3 copies. You can change the default value of 3, by modifying the following registry value : HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Pa rameters\ReplayLagManagerNumAvailableCopies • To enable play down for low disk space thresholds: HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Pa rameters\ReplayLagPlayDownPercentDiskFreeSpace The transport dumpster in Exchange 2010 has been improved in Exchange 2013 and is now called Safety Net SafetyNet Safety Net itself is redundant, and is no longer a single point of failure. The Safety Net is a queue that stores copies of messages that were successfully processed by the server, in case the processed message is corrupted in-transit or fails to reach a destination. A Shadow Safety Net is a redundant copy of the Primary Safety Net and is stored on another Mailbox Server in the same site to provide further redundancy and act when primary SafetyNet is unavailable You can specify how long Safety Net stores copies of the successfully processed messages before they expire and are automatically deleted. The default is 2 days For Mailbox servers that don't belong to a DAGs, Safety Net stores copies of the delivered messages on other Mailbox servers in the local Active Directory site. You can't specify a maximum size limit for Safety Net. You can only specify how long Safety Net stores messages before they're automatically deleted. Parameter SafetyNet Parameters Default value Description The length of time successfully processed primary messages are SafetyNetHoldTime on 2 days stored in Primary Safety Net, and Set-TransportConfig acknowledged shadow messages are stored in Shadow Safety Net. The amount of time that the Microsoft Exchange Replication ReplayLagTime on Set- Not service should wait before replaying MailboxDatabaseCopy configured log files that have been copied to the passive database copy. MessageExpirationTime out on Set2 days TransportService ShadowRedundancyEna bled on Set$true TransportConfig How long a message can remain in a queue before it expires. $true enables shadow redundancy on all transport servers in the organization. $false disables shadow redundancy on all transport servers in the organization. SafetyNet Architecture 1. Mailbox01 receives a message from an SMTP server 2. Mailbox01 initiates a new SMTP session to Mailbox03 makes a shadow copy of the message. 3. The Mailbox Transport service delivers the message to the local mailbox database. 4. Mailbox01 queues a discard status for Mailbox03 that indicates the primary message was successfully processed, and Mailbox01 moves a copy of the primary message into Primary Safety Net. 5. Mailbox03 periodically polls for the primary message. 6. When Mailbox01 successfully processed the primary message, Mailbox03 moves the shadow message into the Shadow Safety Net. 7. The message is retained in Primary Safety Net and Shadow Safety Net until the set timeout value. Datacentre switchover in Exchange 2010 are operationally complex because recovery of mailbox data (DAG) and client access (namespace) are tied together there. Single Global namespace (Multiple Virtual IP (VIP)-to-Name mappings) If you lose all or a significant portion of your CAS, or the VIP for the array, or a significant portion of your DAG, you were in a situation where you needed to do a datacentre switchover. In Exchange 2013, a client can receive multiple IP Addresses from DNS for a given FQDN. Since almost all client access in Exchange 2013 now relies on HTTP (Outlook, Outlook Anywhere, EAS, EWS, OWA, and EAC), if the first IP Address on a HTTP stack fails, the HTTP client will try the next and so on. If a Virtual IP of a CAS array were to fail, the client can automatically connect to other IPs to access the same service in a matter of seconds, instead of waiting minutes for DNS to failover. For Example, if a client tries one and it fails, it waits about 20 seconds and then tries the next one in the list. Thus, if you lose the VIP for the Client Access server array, recovery for the clients happens automatically, and in about 21 seconds. IfRemoving one VIP fails, clients automatically failover to alternate VIP and justofwork! failing IP from DNS puts you in control of in service time VIP mail.domain.com: 192.168.1.50, mail.domain.com: 10.0.1.50 10.0.1.50 VIP: 192.168.1.50 CAS1 CAS2 VIP: 10.0.1.50 CAS3 CAS4 Secondary Datacenter Single Global namespace (Multiple Virtual IP (VIP)-to-Name mappings) If you lose your CAS array, you don't need to perform a datacenter switchover. Clients are automatically redirected to a second datacenter that has operating Client Access servers, which remains unaffected by the outage (because you don't do a switchover). Instead of working to recover service, the service recovers itself and you can focus on fixing the core issue If you lose the load balancer in your primary site, you simply turn it off (or maybe turn off the VIP) and repair or replace it. Clients that aren't already using the VIP in the secondary datacenter will automatically fail over to the secondary VIP without any change of namespace, and without any change in DNS. Not only does that mean you no longer have to perform a switchover, but it also means that all of the time normally associated with a datacenter switchover recovery isn't spent. In Exchange 2013, you don't need to do that because you get fast failover (20 seconds) of the namespace between VIPs (datacenters). Single Global namespace (Multiple Virtual IP (VIP)-to-Name mappings) Example DAG Design: Because you can fail over the namespace between datacenters, all that's needed to achieve a datacenter failover is a mechanism for failover of the Mailbox server role across datacenters. To get automatic failover for the DAG, you simply architect a solution where the DAG is evenly split between two datacenters, and then place the witness server in a third location so that it can be arbitrated by DAG members in either datacenter, regardless of the state of the network between the datacenters that contain the DAG members. Deal with intermittent failures: An intermittent failure requires some sort of extra administrative action to be taken because it might be the result of a replacement device being put into service. In this scenario, the administrator can perform a namespace switchover by simply removing the VIP for the device being replaced from DNS. Then during that service period, no clients will be trying to connect to it. After the replacement process has completed, the administrator can add the VIP back to DNS, and clients will eventually start using it. • Minimize the number of certificates you used for the exchange servers and reverse proxies, use a single certificate for all the services. This minimizes the cost and complexity of the solution Multi-site SSL Certificate Considerations • Single SAN certificate for each datacenter (Include multiple hosts name in the certificate) • Use same certificate Principle name on each certificate to ensure outlook anywhere connectivity after a failover has occurred • Configure the outlook provider configuration object in AD with the same Principal Name in Microsoft-Standard Form (msstd) Eg: Set-OutlookProvider EXPR –certprincipalName “msstd:mail.domain.com” DAG is continuing it’s big role to support the High Availability and Site Resiliency features of Exchange Server Product Lagged copies associated with SafetyNet (Dumpster) is a good move by Exchange Server Product Development Team Conclusions Single Global Namespace Support is the biggest and key feature of Exchange Server 2013 in it’s Site wide Resiliency capability. While the Exchange 2010 design guidelines can apply to an Exchange 2013 organization, the additional enhanced Exchange 2013 design guidelines cannot be applied to Exchange 2010. All of new behaviours and design options applies to Exchange 2013 only. About KEMP Technologies Who We Are • Established in 2000 • US-HQ | NY, EMEA-HQ | Ireland, APAC-HQ | Singapore • 470% Growth in 5 years • Market Segment Leader What We Do • • • • • • • DNS & GEO-IP Load Balancing Server Load Balancing SSL Offload & Acceleration Application Health Checks High Availability Configuration Templates for Exchange Pre-Authentication & Forms-based Authentication- COMING SOON! Get a LoadMaster Today! • Download a Virtual LoadMaster GEO http://www.kemptechnologies.com/ap/server-load-balancingappliances/vlmgeo-loadmaster/vlm-download.html • Download a Virtual LoadMaster http://www.kemptechnologies.com/ap/server-load-balancingappliances/virtual-loadbalancer/vlm-download.html • Sign-up for our Edge Security Pack BETA http://www.kemptechnologies.com/ap/tmg-edge-securityauthentication.html Q&A