Exchange 2013 Managed Availability Explained

Exchange 2013 Managed Availability Introduction to Managed Availability: Part I - An Exchange Administrator‘s task? Microsoft introduced a new built-in monitoring system called Managed Availability in Exchange 2013, which automatically takes recovery actions for unhealthy services within the Exchange organization. Microsoft has been operating a cloud version of Exchange since 2007 and have put all their knowledge into Managed Availability monitoring. Managed Availability is a cloud trained system based on an end user’s experience with recovery oriented computing. Managed Availability doesn’t mean you don’t have to monitor your on-prem or hybrid Exchange environment in fact, it’s just the opposite. The long and complex PowerShell cmdlet’s used to monitor Exchange (which we will look at in more detail later) are not the best and most effective method to do so. Exchange 2013, or even better, the Exchange Diagnostics Service (EDS), collects a lot of performance data by default. Over 3,000 performance counters are compiled over seven days. The folder %Exchange Install Path%\Logging\Diagnostics\PerformanceLogsToBeProcessed collects and merges data onto the daily performance log on a regular basis using the Microsoft Exchange Diagnostics service. You can find this folder under path %Exchange Install Path%\ Logging\Diagnostics\DailyPerformanceLogs which is a .blg file type from the PerfMon. Managed Availability uses these files, among others, to track the health of system components. The performance counters are saved for 7 days or until 5 GB of data is reached by default. You can change these settings in the file called Microsoft.Exchange.Diagnostics.Service.exe.config located in the bin directory of your Exchange installation path: <add Name=”DailyPerformanceLogs” LogDataLoss=”True” MaxSize=”5120″ MaxSizeDatacenter=”2048″ MaxAge=”7.00:00:00″ CheckInterval=”08:00:00″ /> Managed Availability has multiple HealthSet models that are responsible for different services, such as:      Client Protocols: OWA/ECP, ActiveSync, IMAP/POP, UM, Outlook, Compliance Storage: DataProtection, Clustering, PublicFolders, SiteMailbox, Store Mail Flow: FrontEndTransport, HubTransport, MailboxTransport, Deployment Migration: MigrationMonitor, MRS Fabric: Diskspace, MailboxSpace, ActiveDirectory, UserThrottling The main constituents of Managed Availability are Probes, Monitors, and Responders.  Probes run every few minutes against different services, checks the health, and collects data from the server. These results flow in the Monitoring component of Managed Availability. An Exchange 2013 multi-role server is defined by hundreds of probes and in most cases, these Probes are not directly discoverable. This means that most of the Probes are defined within the Exchange program code and not changeable. For example, customers reported the AutoDiscoverSelfTestProbe failed when the ExternalUrl for the EWS virtual directory wasn’t set and there weree no ways to change the probe settings. Therefore, Microsoft resolved this issue in Cumulative Update 6. The Probes write an informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult crimson channel with the following result types:      1 = Timeout 2 = Poisoned 3 = Succeeded 4 = Failed 5 = Quarantined  6 = Rejected Probes are divided into three categories:     Reoccurring Probes: system performed tests for the end-to-end user experience, such as OWA connectivity. Notifications: performs their own monitoring without the health manager framework by directly writing probe results. For example, the MSExchangeDAGMgmt service logs a probe result without Managed Availability. Checks: collects data from performance counters and logs events if the defined thresholds are exceeded or are unmet. Monitors are the central part of Managed Availability. All collected server data is examined to determine if action needs to be taken based on a predefined rule set within the Monitors. Nearly all Monitors collect three types of data:           Direct notifications: Monitors become Unhealthy if a direct notification, for example from a service, changed the Monitor state Probe results: Monitors become Unhealthy if a Probe fails Performance counters: Monitors become Unhealthy if a performance counter is higher or lower than the defined threshold Depending on the issue, a monitor can either initiate a responder or escalate the issue via an event log entry. Monitors have the following various states: Healthy: all collected Probe-data is within a normal state Unhealthy: issue detected; either a recovery process started or escalated Degraded: if a Monitor is in an unhealthy state < 60 seconds Disabled: if a Monitor is manually disabled by an administrator Unavailable: if the Microsoft Exchange Health service doesn’t get a query response from the Monitor Repairing: to inform Managed Availability (or a monitoring software) that corrective actions are in progress Many Monitors have high thresholds of multiple probe failures before becoming Unhealthy to avoid wrong recovery actions taken by Managed Availability and the Responders. For problems that require manual intervention, take a look at the Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel.          Responders take actions generated by an Unhealthy Monitor and perform recovery actions, such as resetting an IIS application pool, initiating a database failover, or restarting a server. Managed Availability uses the following Responder types: Restart Responder: Terminates and restarts a service Reset AppPool Responder: Recycles an IIS AppPool Failover Responder: performs a mailbox database or server failover Bugcheck Responder: initiates a bug check of the server (forcing a reboot) Offline Responder: takes a protocol on a server, such as mapi/http, out of service and thus reject client requests Online Responder: takes a protocol on a server, such as mapi/http, back into production and thus accept client requests Escalate Responder: writes an event log to inform an administrator Specialized Component Responder: some specialized Responders that are unique to their component If you would like to take a look at all recovery actions through the Managed Availability Responders, view the Microsoft.Exchange.ManagedAvailability\RecoveryActionResults crimson channel. This concludes part one of this article. In the second part, we will take a more practical approach to Managed Availability. By using PowerShell we will show you how you can retrieve useful information from the massive amounts of data that Managed Availability collects about your environment. Part II - How to check, Recover, and Maintain your Exchange Organization Part II Now that you’ve finished Part I of my three-part Managed Availability blog series, I will now go a bit deeper and provide some examples about the functionality and operability of Managed Availability. My virtual test lab contains a two-member DAG based on Windows Server 2012 and Exchange 2013 CU6. Identify Unhealthy Health Sets and their error description To get the server state, run the following cmdlet within the Exchange Management Shell: Get-HealthReport -Server | where {$_.alertvalue -ne “Healthy” –and $_.AlertValue –ne “Disabled”} This cmdlet shows multiple HealthSets, which are Unhealthy. In this example, let’s take a look at the HealthSet Clustering, which has 5 Monitors. Note: the property “NotApplicable” shows whether Monitors have been disabled by Set-ServerComponentState for their component. Most Monitors are not dependent on this, and thereby report “NotApplicable.” Because the Clustering HealthSet has 5 Monitors, we check which Monitors are in an Unhealthy state: Get-ServerHealth –Identity -HealthSet The Monitor ClusterGroupMonitor is in an Unhealthy state. To get all the information, especially the appropriate Probe, take a look at the Event Viewer in a readable outpout with the following cmdlet: (Get-WinEvent -ComputerName -LogName Microsoft-ExchangeActiveMonitoring/MonitorDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “**”} This output has two important values to identify the “real problem” of the Clustering HealthSet: SampleMask: defines the appropriate substring that the Probe ClusterGroupProbe for Monitor ClusterGroupMonitor “ClusterGroupProbe\MSExchangeRepl” have in their name the ScenarioDescription: shows more information about the issue From the output above, we found that Validate HA health is not impacted by cluster related issues and therefore wants to fix it. You can retake some Probe checks with the cmdlet Invoke-MonitoringProbe \ -Server | fl Note: For reference, you can take a look at the Exchange Sets: http://technet.microsoft.com/en-us/library/dn195892(v=exchg.150).aspx 2013 Management Pack Health Important: this cmdlet is only available if your Exchange servers are configured for time zones UTC and UTC-. The cmdlet doesn’t work with time zones UTC+ (hopefully Microsoft will fix this issue in the near future). Let’s take a further look at the Probe configuration for the ClusterGroupProbe: (Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “**”} The next step is to identify the complete error message so that every administrator knows what he or she has to do: $Errors = (Get-WinEvent -ComputerName -LogName Microsoft-ExchangeActiveMonitoring/ProbeResult –FilterXPath “*[UserData[EventXML[ResultName=‘/’][ResultType=’4′]]]” | % {[XML]$_.toXml()}).event.userData.eventXml $Errors | select -Property *time,result*,error*,*context Result: the quorum resource “Cluster Group” is not online on server “xsrvmail2.“ Database Availability Group “E2K13-TAP” may not reachable or may have lost redundancy. Why could Managed Availability solve this issue not of itself? Managed Availability is a “self-healing” component of Exchange 2013. As described in the steps above, responders are responsible for trying to repair the Exchange organization on its own without any administrator impact. Let’s take a look which Responders are relevant for the Unhealthy Clustering HealthSet: To display all Probes, Monitors, and Responders of the HealthSet Clustering, run the following cmdlet in the Exchange Management Shell: Get-MonitoringItemIdentity –Identity –Server | ft name,itemtype,targetresource –AutoSize You can see 3 Escalate Responders, based on the “Name” attribute: ClusterEndpointEscalate ClusterServiceCrashEscalate ClusterHangEscalate To identify the correct Responder for our Monitor ClusterGroupMonitor, run the following cmdlet in the Exchange Management Shell: $DefinedResponders = (Get-WinEvent –ComputerName -LogName Microsoft-ExchangeActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml $DefinedResponders | ? {$_.AlertMask -like “**”} | fl Name,AlertMask,EscalationSubject,EscalationMessage,UpdateTime As you can see in the screenshot above, the appropriate Responder called ClusterGroupEscalatewith the parameter Name, AlertMask, EscalationSubject, EscalationMessage, and UpdateTime. Remember: Escalate Responders writes an entry in the Event Viewer to inform an administrator. This means that any issues with the HealthSet Clustering cannot be recovered automatically through Managed Availability. For completeness, let’s make an example with the HealthSet OWA.Protocol: As you can see in the screenshot above, there are much more Responder types for the HealthSet Clustering. To identify the correct Responder for our Monitor OwaSelfTestMonitor with all necessary information, run the following cmdlet: $DefinedResponders = (Get-WinEvent –ComputerName -LogName Microsoft-ExchangeActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml $DefinedResponders | ? {$_.AlertMask -like “**”} | fl Name,AlertMask,EscalationSubject,EscalationMessage,UpdateTime You can see two Responders: OwaSelfTestEscalate: ping request failed and an administrative intervention is needed (Escalate) OwaSelfTestRestart: this Responder carried out a recovery action (but what exactly?) To find out the recovery action from Responder OwaSelfTestRestart let’s grab all information about the Responder configuration: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “**”} As you can see at the parameter ThrottlePolicyXml, which is customizable, there are some Responder definitions:        RecycleApplicationPool “MSExchangeOWAApPool” is self-explanatory ThrottleConfig Enabled: if the ThrottlePolicyXml is enabled (True) or disabled (False) LocalMinimumMinutesBetweenAttempts: how many actions can be taken on this server within the defined timeframe LocalMaximumAllowedAttemptsinOneHour: how many actions can be taken on this server within one hour LocalMaximumAllowedAttemptsinADay: how many actions can be taken on this server within one day GroupMinimumMinutesBetweenAttempts: how many actions can be taken in the DAG or array within the defined timeframe GroupMaximumAllowedAttemptsInADay: how many actions can be takten in the DAG or array within the defined timeframe Next, you should take a look at the Microsoft-Exchange-ManagedAvailability/RecoveryActionResults crimson channel for entries. Event 500 indicates that a recovery action has begun, event 501 indicates that the action was taken has completed, and event 501 indicates that the action threw an error. Note: you have a better overview if you go directly into the Event Viewer in the log name Microsoft-ExchangeManagedAvailability/RecoveryActionResults. For specific troubleshooting, I prefer the Exchange Management Shell. $RecoveryActionResults = Get-WinEvent -ComputerName -LogName Microsoft-ExchangeManagedAvailability/RecoveryActionResults $XML = ($RecoveryActionResults | Foreach-object –Process {[XML]$_.toXml()}).event.userData.eventXml $XML | Where-Object {$_.State -eq “Finished” -and $_.RequestorName -eq “OwaSelfTestRestart”} Note: you can filter your log to the current day if there are too many items logged. It’s easy to use the parameter EndTime, such as the following cmdlet: $XML | Where-Object {$_.State -eq “Finished” -and $_.EndTime -like “2014-08-28T18*” -and $_.RequestorName -eq “OwaSelfTestRestart”} RequestorName = your appropriate Responder, such as “OwaSelfTestRestart” The screenshot above demonstrates two different recovery actions: -The first recycled the MSExchangeOWAAppPool from last year (yes, Exchange works very well) -The second created parallel a Watson Dump because the MSExchangeOWAAppPool application crashed. For a general overview of all recovery actions, ManagedAvailability/RecoveryActionsLog crimson channel: take a look at the Microsoft-Exchange- I prefer to use the Event Viewer because it is clearer. But for those who like to search for specific recovery actions or at an individual time, feel free to use the Exchange Management Shell and create your own additional filters if you need it: $RecoveryActionLogs = Get-WinEvent -ComputerName -LogName Microsoft-ExchangeManagedAvailability/RecoveryActionLogs $XML = ($RecoveryActionLogs | Foreach-object –Process {[XML]$_.toXml()}).event.userData.eventXml $XML | Where-Object {$_.State -eq “Finished” -and $_.RequestorName -eq “OwaSelfTestRestart”} – Happy Monitoring! =) Part III - Local Monitoring Files and Overrides Now that you’ve finished Part I & Part II of my three part Managed Availability blog series, I will now provide some information about local .xml monitoring files and overrides of Managed Availability. Local Managed Availability .xml monitoring files Some HealthSets, such as the FEP HealthSet are local .xml files. Because FEP is the Forefront Endpoint Protection service, some of you may want to disable this HealthSet on the servers, because there is no use for it. Browse to %ExchangeInstallationPath%\Microsoft\Exchange\V15\Bin\Monitoring\Config, search for FEPActiveMonitoringContext.xml and open the file with an editor, such as Notepad. Change line 12 by replacing Enabled = True to Enabled = False Restart the Microsoft Exchange Health Management service on the server where you modified the .xml file. Overrides With overrides, you can change the Managed Availability monitoring thresholds and define you own settings when Managed Availability in case of errors should take action. There are two kinds of overrides:   Local overrides: are used to customize a component on a specific server or on components which aren’t globally available. For example, if you are running multiple data centers and would like to change only server components on a specific location for individual monitoring. Local overrides are managed with the *SetMonitoringOverride set of cmdlets. They are stored in the registry under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overr ides\ and are automatically updated every 10 minutes. The Microsoft Exchange Health Management service reads the changes in the registry path above. Global overrides: are used to customize a component for a whole Exchange organization. They are managed with the *-GlobalMonitoringOverride set of cmdlets. Global overrides are stored in Active Directory: CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=Xiopia,DC=local You can set overrides for specific Exchange versions, such as CU6 with version “15.0.995.29”. This setting will then be effective until the Exchange version changes and will be set with the ApplyVersion parameter. The other method is to set overrides for a specific timeframe. With Exchange 2013 CU6 you can set overrides for a maximum of 365 days with the Duration parameter. Managed Availability and server reboots Responders only execute in the event that a monitor is marked in an Unhealthy state and will try to recover that component. Managed Availability provides multi-stage recovery actions: 1. 2. 3. 4. Restart the application pool Restart the service Restart the server Take the server offline so that it no longer accepts traffic There are several types of responders available: Restart Responder, Rest AppPool Responder, Failover Responder, Bugcheck Responder, Offline Responder, Escalate Responder, and Specialized Component Responders.In this article, I will be primarily discussing Restart Responders. Restart responders are subject to throttling policies. This means, the responder definition contains a section “ThrottlePolicyXML” which can be overridden if desired. For example, we use the “StoreServiceKillServer” responder. To view the definitions, use the following cmdlet via EMS: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “StoreServiceKillServer”} There are many parameters, such as ServiceName, CreatedTime, Enabled, MaxRetryAttempts, AlertMask, and so on. The following section from the restart responder definition is important: ThrottlePolicyXml : LocalMaximumAllowedAttemptsInOneHour=”-1″ LocalMaximumAllowedAttemptsInADay=”1″ GroupMinimumMinutesBetweenAttempts=”600″ GroupMaximumAllowedAttemptsInADay=”4″ /> The thresholds are self-explanatory. The only difference is “Local” and “Group.” Local means one Exchange server and group means there are more than one Exchange servers in your organization. You have to check and configure the setting based on your needs. To prevent a reboot, create a local or global override: I was looking for a “*ForceReboot*” by Managed Availability and found the following Requester: (Get-WinEvent -LogName Microsoft-Exchange-ManagedAvailability/* | % {[XML]$_.toXml()}).event.userData.eventXml| ?{$_.ActionID -like “*ForceReboot*”} | ft RequesterName ServiceHealthMSExchangeReplForceReboot Add-GlobalMonitoringOverride -Identity Exchange\ServiceHealthMSExchangeRepIForceReboot ItemType Responder -PropertyName Enabled -PropertyValue 0 –Duration 90:00:00:00 To check the configuration changes, use the following cmdlet: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/responderdefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “ServiceHealthMSExchangeRepIForceReboot “} | ft name,enabled This prevents the server from a force reboot in case of errors with the “ServiceHealthMSExchangeRepl” health set. Enabled must be “0” (instead of 1). Inform Managed Availability about the repairing process To inform Managed Availability (and your monitoring software too) that you are in a repairing process, use the following cmdlet and define using the Name Parameter for the appropriate Monitor: Set-ServerMonitor –Server -Name Maintenance –Repairing $true After repairing: Set-ServerMonitor –Server -Name Maintenance –Repairing $false To avoid automatic recovery actions, you should disable the managed service using Set-ServerComponentState: Set-ServerComponentState –Component RecoveryActionsEnabled –Identity -State Inactive –Requester Functional After finishing recovery you have to enable the RecoveryActionsEnabled component with the following cmdlet: Set-ServerComponentState –Component RecoveryActionsEnabled –Identity -State Active –Requester Functional Managed Availability – an administrative overview 1. Configuration of Managed Availability After installing Exchange 2013 in production, there might be some HealthSets in an Unhealthy state. 2. AlertValue “Unhealthy” The first step you have to do is a HealthReport from the entire server: Get-HealthReport -Server | where {$_.alertvalue -ne “Healthy”} Note: The property “NotApplicable” shows whether Monitors have been disabled by Set-ServerComponentState for their component. Most Monitors are not dependent on this, and report “NotApplicable”. Let’s take a look at HealthSet “Autodiscover.Protocol” and why it’s in an Unhealthy state. To get all information about the Autodiscover.Protocol HealthSet, we have to analyze the Monitoring Item Identity: Get-MonitoringItemIdentity -Identity Autodiscover.Protocol -Server | ft name,itemtype, Targetresource –AutoSize We can see all appropriate Probes, Monitors, and Responders regarding the Autodiscover.Protocol HealthSet. Because the Autodiscover.Protocol HealthSet has 9 Monitors, we check which Monitors are in an Unhealthy state: Get-ServerHealth –Identity -HealthSet Autodiscover.Protocol The Monitor AutodiscoverSelfTestMonitor is in an Unhealthy state. To get all the information about the appropriate Monitor (AutodiscoverSelfTestMonitor), we collect the information directly from the Eventlog in a readable output: (Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/Monitordefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “AutodiscoverSelfTestMonitor”} This output has two important values for us: An Unhealthy state will be transition within 600 seconds to Unhealthy. The ScenarioDescription parameter tells the administrator more details about the issue. “Validate EWS health is not impacted by any issues”. In most cases you can retake the test with the following cmdlet: Invoke-MonitoringProbe Autodiscover.Protocol\AutodiscoverSelfTestProbe -Server | fl Actually not all protocols can be invoked manually with the Invoke-MonitoringProbe cmdlet. We hope Microsoft will fix this for the future. We know that EWS has some issues and go further to take a look at the Probe definition for the AutodiscoverSelfTestMonitor: (Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/Probedefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “AutodiscoverSelfTestProbe”} You can see the configuration how the Probe will test the Autodiscover.Protocol HealthSet. The next step is to collect the events in the ProbeResult event log crimson channel and filter them. In this example, you look for failure results related to AutodiscoverSelfTestProbe: $Errors = (Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/ProbeResult – FilterXPath “*[UserData[EventXML[ResultName=’AutodiscoverSelfTestProbe/’][ResultType=’4′]]]” | % {[XML]$_.toXml()}).event.userData.eventXml $Errors | select -Property *time,result*,error*,*context ResultTypes: 1 = Timeout 2 = Poisoned 3 = Succeeded 4 = Failed 5 = Quarantined 6 = Rejected You can see the complete error message at “ExecutionContext”. Note: to identify the correct ResultName parameter, you can take a look at the event log directly: Expand Applications and Services Logs – Microsoft – Exchange– ActiveMonitoring – ProbeResult and search for “AutodiscoverSeltTestMonitor” and filter only the Error level events. On the correct error event, click on details and take a look at the Error line: System.Exception: Autodiscover Service failed Microsoft.Exchange.Monitoring.ActiveMonitoringProbe to return the ExternalEWSUrl at Result: The ExternalEWSUrl value is empty. You have to set the value (it must not be available from the Internet) to avoid Managed Availability error messages. It’s not recommended to disable the AutodiscoverSelfTestProbe and their appropriate Monitors and Responders, because there are a lot of more important tests. So don’t set a global or local override against the Probes, Monitors, and Responders from the Autodiscover.Protocol HealthSet. Update 08/26/2014: Microsoft fixed this "issue" with Exchange Server 2013 Cumulative Update 6 (CU6) and it's not relevant if the ExternalEWSUrl is either set or not. 3. Local Managed Availability .xml monitoring files Some HealthSets, such as the FEP HealthSet are local .xml files. FEP is the ForeFront service and some of you would want to disable this HealthSet on the servers. Browse to %ExchangeInstallationPath%\Microsoft\Exchange\V15\Bin\Monitoring\Config, FEPActiveMonitoringContext.xml and open the file with an editor, such as Notepad. search for Change line 12 and replace Enabled = True to Enabled = False Restart the Microsoft Exchange Health Management service on the server where you modified the .xml file. 4. Inform Managed Availability about the repairing process To inform MA (and for example SCOM) that you are in a repairing process, use the following cmdlet and define with the -Name Parameter the appropriate Monitor: Set-ServerMonitor –Server -Name Maintenance –Repairing $true After repairing: Set-ServerMonitor –Server -Name Maintenance –Repairing $false To avoid automatically recovery actions, you should disable the managed service using SetServerComponentState: Set-ServerComponentState –Component RecoveryActionsEnabled –Identity -State Inactive –Requester Functional After finishing recovery you have to enable the RecoveryActionsEnabled Component with the following cmdlet: Set-ServerComponentState –Component RecoveryActionsEnabled –Identity -State Active –Requester Functional 5. Overrides You can set overrides: server overrides, which apply to a single server, and global overrides, which apply to all Exchange server within your organization. You apply overrides using the Add-ServerMonitoringOverride or AddGlobalMonitoringOverride cmdlet. Important: you have to disable all the appropriate Probe, Monitor, and Responder! Example to deactivate the Public Folder monitoring from the Microsoft KB database (Exchange 2013 CU3): Add-GlobalMonitoringOverride –Identity “Publicfolders\PublicFolderLocalEWSLogonEscalate” –ItemType “Responder” –PropertyName Enabled –PropertyValue 0 –ApplyVersion “15.0.775.38” Add-GlobalMonitoringOverride –Identity “Publicfolders\PublicFolderLocalEWSLogonMonitor” –ItemType “Monitor” –PropertyName Enabled –PropertyValue 0 –ApplyVersion “15.0.775.38” Add-GlobalMonitoringOverride –Identity “Publicfolders\PublicFolderLocalEWSLogonProbe” –ItemType “Probe” –PropertyName Enabled –PropertyValue 0 –ApplyVersion “15.0.775.38” Also you can change the ExtensionAttributes from every Probe: ExtensionAttributes: 127.0.0.125InboundProxyProbe<DataAddAttributions=”false”>X-Exchange-Probe-Drop-Message:FrontEnd-CAT250 Subject:Inbound proxy probeNone Local override: Add-ServerMonitoringOverride -Server ServerName -Identity FrontEndTransport\OnPremisesInboundProxy ItemType Probe -PropertyName ExtensionAttributes -PropertyValue ‘Exch1.contoso.com25InboundProxyProbeX-Exchange-Probe-Drop-Message:FrontEnd-CAT-250Subject:Inbound proxy probeNone’ -Duration 45.00:00:00 Global override: Add-GlobalMonitoringOverride -Identity “EWS.Proxy\EWSProxyTestProbe” -ItemType Probe -PropertyName TimeoutSeconds -PropertyValue 25 –ApplyVersion “15.0.712.24” 6. Managed Availability and server reboots Responders only execute in the event if a monitor is marked in a Unhealty state and will try to recover that component. Managed Availability provides multi-stage recovery actions: 1. 2. 3. 4. Restart the application pool Restart the service Restart the server Take the server offline so that it no longer accepts traffic There are several types of responders available: Restart Responder, Rest AppPool Responder, Failover Responder, Bugcheck Responder, Offline Responder, Escalate Responder, and Specialized Component Responders. Today we talk primary about the Restart Responders. But you can also read the entire Managed Availability documentation which is available at http://blogs.technet.com/b/exchange/archive/2012/09/21/lessons-fromthe-datacenter-managed-availability.aspx. Restart responders are subject to throttling policies. This means, the responder definition contains a section “ThrottlePolicyXML” and you can override them, if you like. For example, we use the “StoreServiceKillServer” responder. To view the definitions, use the following cmdlet via EMS: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “StoreServiceKillServer”} There are many parameters, such as ServiceName, CreatedTime, Enabled, MaxRetryAttempts, AlertMask, and so on. Importan for us is the following section from the restart responder definition: ThrottlePolicyXml : < ForceReboot ResourceName=””> < ThrottleConfig Enabled=”True” LocalMinimumMinutesBetweenAttempts=”720″ LocalMaximumAllowedAttemptsInOneHour=”-1″ LocalMaximumAllowedAttemptsInADay=”1″ GroupMinimumMinutesBetweenAttempts=”600″ GroupMaximumAllowedAttemptsInADay=”4″ /> < /ForceReboot> The thresholds are self-explanatory. The only difference is “Local” and “Group”. Local means one Exchange server, group means more than one Exchange server in your organization. You have to check and configure the setting for your needs. To prevent a reboot, create a local or global override: Example: I was looking for a “*ForceReboot*” by Managed Availability and found the following Requester: (Get-WinEvent -LogName Microsoft-Exchange-ManagedAvailability/* | % {[XML]$_.toXml()}).event.userData.eventXml| ?{$_.ActionID -like “*ForceReboot*”} | ft RequesterName ServiceHealthMSExchangeReplForceReboot Add-GlobalMonitoringOverride -Identity Exchange\ServiceHealthMSExchangeRepIForceReboot -ItemType Responder -PropertyName Enabled -PropertyValue 0 –Duration 60:00:00:00 To check the configuration changes, use the following cmdlet: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/responderdefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “ServiceHealthMSExchangeRepIForceReboot “} | ft name,enabled This prevents the server from a force reboot. Enabled must be “0” (instead of 1). Lessons from the Datacenter: Managed Availability Monitoring is a key component for any successful deployment of Exchange. In previous releases, we invested heavily in developing a correlation engine and worked closely with the System Center Operations Manager (SCOM) product group to provide a comprehensive alerting solution for Exchange environments. Traditionally, monitoring involved collecting data and if necessary performing an action based on the collected data. For example, in the context of SCOM, different mechanisms were used to collect data via the Exchange Management Pack: Type of Monitoring Exchange 2010 Service not running Health manifest rule Performance counter Health manifest counter rule Exchange events Health manifest event rule Non-Exchange events Health manifest event rule Scripts, cmdlets Health manifest script rule Test cmdlets Test cmdlets Exchange Server 2013 Monitoring Goals When we started the development of Exchange 2013, a key focus area was improving end-to-end monitoring for all Exchange deployments, from the smallest on-premises deployment to largest deployment in the world, Office 365. We had three major goals in mind: 1. Bring our knowledge and experience from the Office 365 service to on-premises customers For nearly 6 years, the Exchange product group has been operating Exchange Online. The operations model we use is known as the developer operations model (DevOps). In this model, issues are escalated directly to the developer of a feature if their feature is performing badly within the service or when a customer reports an unknown problem that is escalated. Regardless of the problem origin, escalation directly to the developer brings accountability to the development of the software by addressing the software defects. As a result of using this model, we have learned a lot. For example, in the Exchange 2010 Management Pack, there are almost 1100 alerts. But over the years, we found that only around 150 of those alerts are useful in Office 365 (and we disable the rest). In addition, we found that when a developer receives an escalation, they are more likely to be accountable and work to resolving the issue (either through a code fix, through new recovery workflows, by fine tuning the alert, etc.) because the developer does not want to be continuously interrupted or woken up to troubleshoot the same problem over and over again. Within the DevOps model, we have a process where every week the on-calls will have a hand-off meeting to review of the past week’s incidents; the result is that in-house recovery workflows are developed, such as resetting IIS application pools, etc. Before Exchange 2013, we had our own in-house engine that handled these recovery workflows. This knowledge never made it into the on-premises products. With Exchange 2013, we have componentized the recovery workflow engine so that the learnings can be shared with our on-premises customers. 2. Monitor based on the end user’s experience We also learned that a lot of the methodologies we used for monitoring really didn’t help us to operate the environment. As a result, we are shifting to a customer focused vision for how we approach monitoring. In past releases, each component team would develop a health model by articulating all the various components within their system. For example, transport is made up of SMTP-IN, SMTPOUT, transport agents, categorizer, routing engine, store driver, etc. The component team would then build alerts about each of these components. Then in the Management Pack, there are alerts that let you know that store driver has failed. But what is missing is the fact that the alert doesn’t tell you about the end-to-end user experience, or what might be broken in that experience. So in Exchange 2013, we are attempting to turn the model upside down. We aren’t getting rid of our system level monitoring because that is important. But what is really important, if you want to manage a service, is what are your users seeing? So we flipped the model, and are endeavoring to try and monitor the user experience. 3. Protect the user’s experience through recovery oriented computing With the previous releases of Exchange, monitoring was focused on the system and components, and not how to recover and restore the end user experience automatically. Monitoring Exchange Server 2013 - Managed Availability Managed Availability is a monitoring and recovery infrastructure that is integrated with Exchange’s high availability solution. Managed Availability detects and recovers from problems as they occur and as they are discovered. Managed Availability is user-focused. We want to measure three key aspects – the availability, the experience (which for most client protocols is measured as latency), and whether errors are occurring. To understand these three aspects, let’s look at an example, a user using Outlook Web App (OWA). The availability aspect is simply whether or not the user can access the OWA forms-based authentication web page. If they can’t, then the experience is broken and that will generate a help desk escalation. In terms of experience, if they can log into OWA, what is the experience like – does the interface load, can they access their mail? The last aspect is latency–the user is able to log in and access the interface, but how fast is the mail rendered in the browser? These are the areas that make up the end user experience. One key difference between previous releases and Exchange 2013, is that in Exchange 2013 our monitoring solution is not attempting to deliver root cause (this doesn’t mean the data isn’t recorded in the logs, or dumps aren’t generated, or that root cause cannot be discovered). It is important to understand that with past releases, we were never effective at communicating root cause - sometimes we were right, often times we were wrong. Components of Managed Availability Managed Availability is built into both server roles in Exchange 2013. It includes three main asynchronous components. The first component is the probe engine. The probe engine’s responsibility is to take measurements on the server. This flows into the second component, which is the monitor. The monitor contains the business logic that encodes what we consider to be healthy. You can think of it as a pattern recognition engine; it is looking for different patterns within the the different measurements we have, and then it can make a decision on whether something is considered healthy or not. Finally, there is the responder engine, what I have labeled as Recover in the below diagram. When something is unhealthy its first action is to attempt to recover that component. Managed Availability provides multistage recovery actions – the first attempt might be to restart the application pool, the second attempt might be to restart service, the third attempt might be to restart the server, and the final attempt may be to offline the server so that it no longer accepts traffic. If these attempts fail, managed availability then escalates the issue to a human through event log notification. You may also notice we have decentralized a few things. In the past we had the SCOM agent on each server and we’d stream all measurements to a central SCOM server. The SCOM server would have to evaluate all the measurements to determine if something is healthy or not. In a high scale environment, complex correlation runs hot. What we noticed is that alerts would take longer to fire, etc. Pushing everything to one central place didn’t scale. So instead we have each individual server act as an island – each server executes its own probes, monitors itself, and takes action to self-recover, and of course, escalate if needed. Figure 1: Components of Managed Availability Probes The probe infrastructure is made up of three distinct frameworks: 1. Probes are synthetic transactions that are built by each component team. They are similar to the test cmdlets in past releases. Probes measure the perception of the service by executing synthetic end-to-end user transactions. 2. Checks are a passive monitoring mechanism. Checks measure actual customer traffic. 3. The notify framework allows us to take immediate action and not wait for a probe to execute. That way we if we detect a failure, we can take immediate action. The notify framework is based on notifications. For example, when certificate expires, a notification event is triggered, alerting operations that the certificate needs to be renewed. Monitors Data collected by probes are fed into monitors. There isn’t necessarily a one-to-one correlation between probes and monitors; many probes can feed data into a single monitor. Monitors look at the results of probes and come to a conclusion. The conclusion is binary – either the monitor is healthy or it is unhealthy. As mentioned previously, Exchange 2013 monitoring focuses on the end user experience. To accomplish that we have to monitor at different layers within the environment: Figure 2: Monitoring at different layers to check user experience As you can see from the above diagram, we have four different checks. The first check is the Mailbox self-test; this probe validates that the local protocol or interface can access the database. The second check is known as a protocol self-test and it validates that the local protocol on the Mailbox server is functioning. The third check is the proxy self-test and it executes on the Client Access server, validating that the proxy functionality for the protocol is functioning. The fourth and last check is the all-inclusive check that validates the end-to-end experience (protocol proxy to the store functions). Each check performs detection at different intervals. We monitor at different layers to deal with dependencies. Because there is no correlation engine in Exchange 2013, we try to differentiate our dependencies with unique error codes that correspond to different probes and with probes that don’t include touching dependencies. For example, if you see a Mailbox Self-Test and a Protocol Self-test probe both fail at the same time, what does it tell you? Does it tell you store is down? Not necessarily; what it does tell you is that the local protocol instance on the Mailbox server isn’t working. If you see a working Protocol Self-Test but a failed Mailbox Self-Test, what does that tell you? That scenario tells you that there is a problem in the “storage” layer which may actually be that the store or database is offline. What this means from a monitoring perspective, is that we now have finer control over what alerts are issued. For example, if we are evaluating the health of OWA, we are more likely to delay firing an alert in the scenario where we have a failed Mailbox Self-Test, but a working Protocol Self-Test; however an alert would be fired if both the Mailbox Self-Test and Protocol Self-Test monitors are unhealthy. Responders Responders execute responses based on alerts generated by a monitor. Responders never execute unless a monitor is unhealthy. There are several types of responders available:        Restart Responder Terminates and restarts service Reset AppPool Responder Cycles IIS application pool Failover Responder Takes an Exchange 2013 Mailbox server out of service Bugcheck Responder Initiates a bugcheck of the server Offline Responder Takes a protocol on a machine out of service Escalate Responder escalates an issue Specialized Component Responders The offline responder is used to remove a protocol from use on the Client Access servers. This responder has been designed to be load balancer-agnostic. When this responder is invoked, the protocol will not acknowledge the load balancer health check, thereby enabling the load balancer to remove the server or protocol from the load balancing pool. Likewise, there is a corresponding online responder that is automatically initiated once the corresponding monitor becomes healthy again (assuming there are no other associated monitors in an unhealthy state) – the online responder simply allows the protocol to respond to the load balancer health check, which enables the load balancer to add the server or protocol back into the load balancer pool. The offline responder can be invoked manually via the SetServerComponentState cmdlet, as well. This enables administrators to manually put Client Access servers into maintenance mode. When the escalate responder is invoked, it generates a Windows event that the Exchange 2013 Management Pack recognizes. It isn’t a normal Exchange event. It’s not an event that says OWA is broken or we’ve had a hard IO. It’s an Exchange event that that a health set is unhealthy or healthy. We use single instance events like that to manipulate the monitors inside SCOM. And we’re doing this based on an event generated in the escalate responder as opposed to events spread throughout the product. Another way to think about it is a level of indirection. Managed Availability decides when we flip a monitor inside SCOM. Managed Availability makes the decision as to when an escalation should occur, or in other words, when a human should get engaged. Responders can also be throttled to ensure that the entire service isn’t compromised. Throttling differs depending on the responder:     Some responders take into account the minimum number of servers within the DAG or load balanced CAS pool Some responders take into account the amount of time between executions. Some responders take into account the number of occurrences that the responder has been initiated. Some may use any combination of the above. Depending on the responder, when throttling occurs, the responder’s action may be delayed or simply skipped. Recovery Sequences It is important to understand that the monitors define the types of responders that are executed and the timeline in which they are executed; this is what we refer to as a recovery sequence for a monitor. For example, let’s say the probe data for the OWA protocol (the Protocol Self-Test) triggers the monitor to be unhealthy. At this point the current time is saved (we’ll refer to this as T). The monitor starts a recovery pipeline that is based on current time. The monitor can define recovery actions at named time intervals within the recovery pipeline. In the case of the OWA protocol monitor on the Mailbox server, the recovery sequence is: 1. At T=0, the Reset IIS Application Pool responder is executed. 2. If at T=5 minutes the monitor hasn’t reverted to a healthy state, the Failover responder is initiated and databases are moved off the server. 3. If at T=8 minutes the monitor hasn’t reverted to a healthy state, the Bugcheck responder is initiated and the server is forcibly rebooted. 4. If at T=15 minutes the monitor still hasn’t reverted to a healthy state, the Escalate responder is triggered. The recovery sequence pipeline will stop when the monitor becomes healthy. Note that the last named time action doesn’t have to complete before the next named time action starts. In addition, a monitor can have any number of named time intervals. Systems Center Operations Manager (SCOM) Systems Center Operations Manager (SCOM) is used as a portal to see health information related to the Exchange environment. Unhealthy states within the SCOM portal are triggered by events written to the Application log via the Escalate Responder. The SCOM dashboard has been refined and now has three key areas:    Active Alerts Organization Health Server Health The Exchange Server 2013 SCOM Management Pack will be supported with SCOM 2007 R2 and SCOM 2012. Overrides With any environment, defaults may not always be the optimum condition, or conditions may exist that require an emergency action. Managed Availability includes the ability enable overrides for the entire environment or on an individual server. Each override can be set for a specified duration or to apply to a specific version of the server. The *-ServerMonitoringOverride and *-GlobalMonitoringOverride cmdlets enable administrators to set, remove, or view overrides. Health Determination Monitors that are similar or are tied to a particular component’s architecture are grouped together to form health sets. The health of a health set is always determined by the “worst of” evaluation of the monitors within the health set – this means that if you have 9 monitors within a health set and 1 monitor is unhealthy, then the health set is considered unhealthy. You can determine the collection of monitors (and associated probes and responders) in a given health set by using the Get-MonitoringItemIdentity cmdlet. To view health, you use the Get-ServerHealth and Get-HealthReport cmdlets. Get-ServerHealth is used to retrieve the raw health data, while Get-HealthReport operates on the raw health data and provides a current snapshot of the health. These cmdlets can operate at several layers:   They can show the health for a given server, breaking it down by health set. They can be used to dive into a particular health set and see the status of each monitor.  They can be used to summarize the health of a given set of servers (DAG members, or load-balanced array of CAS). Health sets are further grouped into functional units called Health Groups. There are four Health Groups and they are used for reporting within the SCOM Management Portal: 1. 2. 3. 4. Customer Touch Points – components with direct real-time, customer interactions (e.g., OWA). Service Components – components without direct, real-time, customer interaction (e.g., OAB generation). Server Components – physical resources of a server (e.g., disk, memory). Dependency Availability – server’s ability to call out to dependencies (e.g., Active Directory). Conclusion Managed Availability performs a variety of health assessments within each server. These regular, periodic tests determine the viability of various components on the server, which establish the health of the server (or set of servers) before and during user load. When issues are detected, multi-step corrective actions are taken to bring the server back into a functioning state; in the event that the server is not returned to a healthy state, Managed Availability can alert operators that attention is needed. The end result is that Managed Availability focuses on the user experience and ensures that while issues may occur, the experience is minimally impacted, if at all, impacted. Ross Smith IV Greg "The All-father of Exchange HA" Thiel Principal Program Manager Principal Program Exchange Customer Experience Exchange Server Manager Architect Exchange 2013 Health and Server Reports Microsoft Exchange Server 2013 introduced new monitoring of the Exchange subsystem, which was also improved by the release of CU1. This feature is known as Managed Availability. A good description of this feature can be found on the Exchange Team blog in this post –http://blogs.technet.com/b/exchange/archive/2012/09/21/lessons-from-thedatacenter-managed-availability.aspx. What my blog post concentrates on is how to access this information and make it useable to the Exchange Admin/Engineer. In the RTM version of Exchange 2013 the commands get-serverhealth and get-healthreport were connected more closely with one piping into the other. CU1 for Exchange 2013 now allows these PowerShell commands to be run independently of each other. Get-HealthReport provides the relative health of various components in Exchange and what state it is in – online, partially online, offline, sidelined, functional, or unavailable. The healthsets available are: Autodiscover.Protocol, ActiveSync, ActiveSync.Protocol, Autodiscover, Autodiscover.Proxy,ActiveSync.Proxy, ECP.Proxy, EWS.Proxy, OAB.Proxy, OWA.Proxy, RPS.Proxy, RWS.P roxy,Outlook.Proxy, Antimalware, AD, ECP, Ediscovery.Protocol, EDS, EventAssistants, EWS.Protocol, DataProtection, EWS, FIPS, FrontendTransport, HubTransport, HDPhoto, Monitoring, Clustering, DiskController, AntiSpam, FfoQuarantine, MailboxTransport, MSExchangeCertificateDeployment, OWA.Protocol.Dep, MailboxSpace, MailboxMigration, MRS, Network, Search, OWA.Protocol, OWA, PublicFolders, Transport, RPS, SiteMailbox, Outlook, Outlook.Protocol, Store, UM.CallRouter, UM.Protocol, UserThrottling, DAL, Security, IMAP.Protocol, Datamining, Provisioning, POP.Protocol, ProcessIsolation, TransportSync, MessageTracing, CentralAdmin, OAB, Calendaring, PushNotifications.Protocol and RemoteMonitoring. Now let’s take a look at the output of the command: What does this report tell us? The report gives the admin or engineer a quick view of the Exchange 2013’s relative health. In the sample report that was attached you can see that there were several Health Sets that show as Unhealthy. However, we don’t know exactly know what that means. Let’s dig a bit deeper into each health set that shows healthy. First, let’s get just that subset of Health Sets. Just run this command: get-healthreport -server ds-l1-e2k13-01 | where {$_.alertvalue -ne “healthy”} | ft -auto Let’s go a step further and get more information on each of these Health Sets and which tests triggered the non-Healthy state: $test = get-healthreport -server ds-l1-e2k13-01 | where {$_.alertvalue -ne “healthy”} foreach ($line in $test) {$line.entries | where {$_.alertvalue -ne “healthy”} | ft -auto} Now we have more information to work with, but still we don’t know why these tests failed. How do we get this information? foreach ($line in $test) {$line.entries | where {$_.alertvalue -ne “healthy”} | ft server, name, healthgroupname, alertvalue, lastexecutionresult -auto} Now that we have all the information we can squeeze out of the Get-HealthReport command, the next step would be to see why any of these fail. I would begin typical Exchange troubleshooting by looking at event logs as well as turning up diagnostic logging. Related Information Get-HealthReport Get-ServerHealth Technet article 1 Technet article 2 Da <https://justaucguy.wordpress.com/2013/06/06/exchange-2013-health-and-server-reports-ps-part-1/> Exchange 2013 Health and Server Reports (PS) – Part 2 In the second part of the series we’ll explore the get-serverhealth PowerShell command and what information can be delved in Exchange 2013 with it. So what is the get-serverhealth command good for? For starters, let’s look at the command and its output: get-serverhealth -identity ds-l1-e2k13-01 |ft server,currenthealthsetstate,name,healthsetname,alertvalue,healthgroupname -auto This produces the following output – Server Health XLXS file. What you will notice is that there are several Health Groups listed: Customer Touch Points, Service Components, Server Resources and Key Dependencies. Within each Health Group is a grouping of Health Sets. Customer Touch Points Service Components Server Resources Key Dependencies Now what can we learn from these Health Sets? From the above screenshot you can see that there are two issues with the server, having to do with AutoDiscover, and both are set to Unhealthy. Microsoft has a set of troubleshooting steps for errors that come up in these health checks: AutodiscoverSelfTestMonitor is unhealthy AutoDiscover is unhealthy If we were to examine one of these, the AutoDiscoverCTPMonitor, we can see that the ‘AutoDiscover is unhealthy’ article directory references our exact issue: If we were to run the “Invoke-MonitoringProbe Autodiscover\AutodiscoverCtpProbe -Server server1.contoso.com | Format-List” command, we would hope to generate the error that this probe has detected. From the error it seems that there I have not configured my External EWS directory. This can be confirmed in PowerShell: Indeed, the error seems to be generated because I had not configured an external URL for EWS. I then set my EWS virtual directory After this change was made, I’ve generated a new error: As this server was meant for testing, multiple URLs are not configured and thus would lead me down a rabbit hole so to speak to remove all the issues. So as you can see, these monitors are very useful for determining what is wrong with your server. Related Resources Health Sets All Health Set Troubleshooters What Did Managed Availability Just Do To This Service? We in the Exchange product group get this question from time to time. The first thing we ask in response is always, “What was the customer impact?” In some cases, there is customer impact; these may indicate bugs that we are motivated to fix. However, in most cases there was no customer impact: a service restarted, but no one noticed. We have learned while operating the world’s largest Exchange deployment that it is fantastic when something is fixed before customers even notice. This is so desirable that we are willing to have a few extra service restarts as long as no customers are impacted. You can see this same philosophy at work in our approach to database failovers since Exchange 2007. The mantra we have come to repeat is, “Stuff breaks, but the user experience doesn’t!” User experience is our number one priority at all times. Individual service uptime on a server is a less important goal, as long as the user experience remains satisfactory. However, there are cases where Managed Availability cannot fix the problem. In cases like these, Exchange provides a huge amount of information about what the problem might be. Hundreds of things are checked and tested every minute. Usually, Get-HealthReport and Get-ServerHealth will be sufficient to find the problem, but this blog post will walk you through getting the full details from an automatic recovery action to the results of all the probes by: 1. Finding the Managed Availability Recovery Actions that have been executed for a given service. 2. Determining the Monitor that triggered the Responder. 3. Retrieving the Probes that the Monitor uses. 4. Viewing any error messages from the Probes. Finding Recovery Actions Every time Managed Availability takes a recovery action, such as restarting a service or failing over a database, it logs an event in the Microsoft.Exchange.ManagedAvailability/RecoveryActions crimson channel. Event 500 indicates that a recovery action has begun. Event 501 indicates that the action that was taken has completed. These can be collected via the MMC Event Viewer, but we usually find it more useful to use PowerShell. All of these Managed Availability recovery actions can be collected in PowerShell with a simple command: $RecoveryActionResultsEvents = Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ManagedAvailability/RecoveryActionResults We can use the events in this format, but it is easier to work with the event properties if we use PowerShell’s native XML format: $RecoveryActionResultsXML = ($RecoveryActionResultsEvents | Foreach-object -Process {[XML]$_.toXml()}).event.userData.eventXml Some of the useful properties for this Recovery Action event are:    Id: The action that was taken. Common values are RestartService, RecycleAppPool, ComponentOffline, or ServerFailover. State: Whether the action has started (event 500) or finished (event 501). ResourceName: The object that was affected by the action. This will be the name of a service for RestartService actions, or the name of a server for server-level actions. EndTime: The time the action completed. Result: Whether the action succeeded or not.  RequestorName: The name of the Responder that took the action. So for example, if you wanted to know why MSExchangeRepl was restarted on your server around 9:30PM, you could run a command like this:   $RecoveryActionResultsXML | Where-Object {$_.State -eq “Finished” -and $_.ResourceName – eq "MSExchangeRepl" -and $_.EndTime -like "2013-06-12T21*"}| ft -AutoSize StartTime,RequestorName This results in the following output:  StartTime  RequestorName  ---------  -------------  2013-05-12T21:49:18.2113618Z  ServiceHealthMSExchangeReplEndpointRestart The RequestorName property indicates the name of the Responder that took the action. In this case, it was ServiceHealthMSExchangeReplEndpointRestart. Often, the responder name will give you an indication of the problem. Other times, you will want more details. Finding the Monitor that Triggers a Responder Monitors are the central part of Managed Availability. They are the primary means, through GetServerHealth and Get-HealthReport, by which an administrator can learn the health of a server. Recall that a Health Set is a grouping of related Monitors. This is why much of our troubleshooting documentation is focused on these objects. It will often be useful to know what Monitors and Health Sets are repeatedly unhealthy in your environment. Every time the Health Manager service starts, it logs events to the Microsoft.Exchange.ActiveMonitoring/ResponderDefinition crimson channel, which we can use to get the properties of the Responders we found in the last step by the RequestorName property. First, we need to collect the Responders that are defined: $DefinedResponders = (Get-WinEvent –ComputerName <Server> -LogName MicrosoftExchange-ActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml One of these Responder Definitions will match the Recovery Action’s RequestorName. The Monitor that controls the Responder we are interested in is defined by the AlertMask property of that Definition. Here are some of the useful Responder Definition properties: TypeName: The full code name of the recovery action that will be taken when this Responder executes.  Name: The name of the Responder.  TargetResource: The object this Responder will act on.  AlertMask: The Monitor for this Responder.  WaitIntervalSeconds: The minimum amount of time to wait before this Responder can be executed again. There are other forms of throttling that will also affect this Responder. To get the Monitor for the ServiceHealthMSExchangeReplEndpointRestart Responder, you run:  $DefinedResponders | ? {$_.Name –eq “ServiceHealthMSExchangeReplEndpointRestart”} | ft -a Name,AlertMask This results in the following output:  Name  AlertMask  ----  ---------  ServiceHealthMSExchangeReplEndpointRestart  ServiceHealthMSExchangeReplEndpointMonitor Many Monitor names will give you an idea of what to look for. In this case, the ServiceHealthMSExchangeReplEndpointMonitor Monitor does not tell you much more than the Responder name did. The Technet article on Troubleshooting DataProtection Health Set lists this Monitor and suggests running Test-ReplicationHealth. However, you can also get the exact error messages of the Probes for this Monitor with a couple more commands. Finding the Probes for a Monitor Remember that Monitors have their definitions written to the Microsoft.Exchange.ActiveMonitoring/MonitorDefinition crimson channel. Thus, you can get these in a similar way as the Responder definitions in the last step. You can run: $DefinedMonitors = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-ExchangeActiveMonitoring/MonitorDefinition | % {[xml]$_.toXml()}).event.userData.eventXml Some useful properties of a Monitor definition are: Name: The name of this Monitor. This is the same name reported by Get-ServerHealth.  ServiceName: The name of the Health Set for this Monitor.  SampleMask: The substring that all Probes for this Monitor will have in their names.  IsHaImpacting: Whether this Monitor should be included when HaImpactingOnly is specified by Get-ServerHealth or Get-HealthReport. To get the SampleMask for the identified Monitor, you can run:  ($DefinedMonitors | ? {$_.Name -eq ‘ServiceHealthMSExchangeReplEndpointMonitor’}).SampleMask This results in the following output:  ServiceHealthMSExchangeReplEndpointProbe Now that we know what Probes to look for, we can search the Probes’ definition channel. Useful properties for Probe Definitions are: Name: The name of the Probe. This will begin with the SampleMask of the Probe’s Monitor.  ServiceName: The Health Set for this Probe.  TargetResource: The object this Probe is validating. This is appended to the Name of the Probe when it is executed to become a Probe Result ServiceName.  RecurrenceIntervalSeconds: How often this Probe executes.  TimeoutSeconds: How long this Probe should wait before failing. To get definitions of this Monitor’s Probes, you can run:  (Get-WinEvent –ComputerName <Server> -LogName Microsoft-ExchangeActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “ServiceHealthMSExchangeReplEndpointProbe*”} | ft -a Name, TargetResource This results in the following output: Remember, not all Monitors use synthetic transactions via Probes. See this blog post for the other ways Monitors collect their information. This Monitor has three Probes that can cause it to become Unhealthy. You’ll see that they are named such that each is named with the Monitor’s SampleMask, but are then differentiated. When getting the Probe Results in the next step, the Probes will also have the TargetResource in their ServiceName. Now that we know all the Probes that could have failed, but we don’t yet know which did or why. Getting Probe Error Messages There are many Probes and they execute often, so the channel where they are logged (Microsoft.Exchange.ActiveMonitoring/ProbeResult) generates a lot of data. There will often only be a few hours of data, but the Probes we are interested in will probably have a few hundred Result entries. Here are some of the Probe Result properties you may be interested in for troubleshooting: ServiceName: The Health Set of this Probe.  ResultName: The Name of this Probe, including the Monitor’s SampleMask, an identifier of the code this Probe executes, and the resource it verifies. The target resource is appended to the Probe’s name we found in the previous step. In this example, we append /MSExchangeRepl to get ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExchangeRepl.  Error: The error returned by this Probe, if it failed.  Exception: The callstack of the error, if it failed.  ResultType: An integer that indicates one of these values:  1: Timeout  2: Poisoned  3: Succeeded  4: Failed  5: Quarantined  6: Rejected  ExecutionStartTime: When the Probe started.  ExecutionEndTime: When the Probe completed.  ExecutionContext: Additional information about the Probe’s execution.  FailureContext: Additional information about the Probe’s failure. Some Probes may use some of the other available fields to provide additional data about failures.  We can use XPath to filter the large number of events to just the ones we are interested in; those with the ResultName we identified in the last step and with a ResultType of 4 indicating that they failed: $replEndpointProbeResults = (Get-WinEvent –ComputerName <Server> -LogName MicrosoftExchange-ActiveMonitoring/ProbeResult -FilterXPath “*[UserData[EventXML[ResultName=’ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExcha ngeRepl’][ResultType=’4′]]]” | % {[XML]$_.toXml()}).event.userData.eventXml To get a nice graphical view of the Probe’s errors, you can run: $replEndpointProbeResults | select -Property *Time,Result*,Error*,*Context,State* | Out-GridView In this case, the full error message for both Probe Results suggests making sure the MSExchangeRepl service is running. This actually is the problem, as for this scenario I restarted the service manually. Summary This article is a detailed look at how you have access to an incredible amount of information about the health of Exchange Servers. Hopefully, you will not often need it! In most cases, the alerts will be enough notification and the included cmdlets will be sufficient for investigation. Managed Availability is built and hardened at scale, and we continuously analyze these same events collected in this article so that we can either fix root causes or write Responders to fix more problems before users are impacted. In those cases where you do need to investigate a problem in detail, we hope this post is a good starting point. Da <https://blogs.technet.microsoft.com/exchange/2013/06/13/what-did-managed-availability-just-do-to-this-service/> Customizing Managed Availability Exchange Server 2013 introduces a new feature called Managed Availability, which is a built-in monitoring system with selfrecovery capabilities. If you’re not familiar with Managed Availability, it’s a good idea to read these posts:   Lessons from the Datacenter: Managed Availability What Did Managed Availability Just Do To This Service? As described in the above posts, Managed Availability performs continuous probing to detect possible problems with Exchange components or their dependencies, and it performs recovery actions to make sure the end user experience is not impacted due to a problem with any of these components. However, there may be scenarios where the out-of-box settings may not be suitable for your environment. This blog post guides you on how to examine the default settings and modify them to suit your environment. Managed Availability Components Let’s start by finding out which health sets are on an Exchange server: Get-HealthReport -Identity Exch2 This produces the output similar to the following: Next, use Get-MonitoringItemIdentity to list out the probes, monitors, and responders related to a health set. For example, the following command lists the probes, monitors, and responders included in the FrontendTransport health set: Get-MonitoringItemIdentity -Identity FrontendTransport -Server exch1 | ft name,itemtype – AutoSize This produces output similar to the following: You might notice multiple probes with same name for some components. That’s because Managed Availability creates a probe for each resource. In following example, you can see that OutlookRpcSelfTestProbe is created multiple times (one for each mailbox database present on the server). Use Get-MonitoringItemIdentity to list the monitoring Item Identities along with the resource for which they are created: Get-MonitoringItemIdentity -Identity Outlook.Protocol -Server exch1 | ft name,itemtype,targetresource –AutoSize Customize Managed Availability Managed Availability components (probes, monitors and responders) can be customized by creating an override. There are two types of override: local override and global override. As their names imply, a local override is available only on the server where it is created, and a global override is used to deploy an override across multiple servers. Either override can be created for a specific duration or for a specific version of servers. Local Overrides Local overrides are managed with the *-ServerMonitoringOverride set of cmdlets. Local overrides are stored under following registry path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides\ The Microsoft Exchange Health Management service reads this registry path every 10 minutes and loads configuration changes. Alternatively, you can restart the service to make the change effective immediately. You would usually create a local override to:   Customize a managed availability component that is server-specific and not available globally; or Customize a managed availability component on a specific server. Global Overrides Global overrides are managed with the *-GlobalMonitoringOverride set of cmdlets. Global overrides are stored in the following container in Active Directory: CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=Contoso,DC=com Get Configuration Details The configuration details of most of probes, monitors, and responders are stored in the respective crimson channel event log for each monitoring item identity, you would examine these first before deciding to change. In this example, we will explore properties of a probe named “OnPremisesInboundProxy”, which is part of the FrontendTransport Health Set. The following script lists detail of the OnPremisesInboundProxy probe: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundProxy"} You can also use Event Viewer to get the details of probe definition. The configuration details most of the Probes are stores in the ProbeDefinition channel: 1. Open Event Viewer, and then expand Applications and Services Logs\Microsoft\Exchange\ActiveMonitoring\ProbeDefinition. 2. Click on Find, and then enter OnPremisesInboundProxy. 3. The General tab does not show much detail, so click on the Details tab, it has the configuration details specific to this probe. Alternatively, you can copy the event details as text and paste it into Notepad or your favorite editor to see the details. Override Scenarios Let’s look at couple real-life scenarios and apply our learning so far to customize managed availability to our liking, starting with local overrides. Creating a Local Override In this example, an administrator has customized one of the Inbound Receive connectors by removing the binding of loopback IP address. Later, they discover that the FrontEndTransport health set is unhealthy. On further digging, they determine that the OnPremisesInboundProxy probe is failing. To figure out why the probe is failing, you can first list the configuration details of OnPremisesInboundProxy probe. (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundProxy"} Name : OnPremisesInboundProxy WorkItemVersion : [null] ServiceName : FrontendTransport DeploymentId : 0 ExecutionLocation : [null] CreatedTime : 2013-08-06T12:54:29.7571195Z Enabled : 1 TargetPartition : [null] TargetGroup : [null] TargetResource : [null] TargetExtension : [null] TargetVersion : [null] RecurrenceIntervalSeconds : 300 TimeoutSeconds : 60 StartTime : 2013-08-06T12:54:36.7571195Z UpdateTime : 2013-08-06T12:48:27.1418660Z MaxRetryAttempts : 1 ExtensionAttributes : <ExtensionAttributes><WorkContext><SmtpServer>127.0.0.1</SmtpServer><Port>2 5</Port><HeloDomain>InboundProxyProbe</HeloDomain><MailFrom Username="inboundproxy@contoso.com"/><MailTo ..Select="All" Username="HealthMailboxdd618748368a4935b278e884fb41fd8a@FM.com"/><Data AddAttributions="false">X-Exchange-Probe-Drop-Message:FrontEnd-CAT-250 Subject:Inbound proxy probe</Data><ExpectedConnectionLostPoint>None</ExpectedConnectionLostPoint> </WorkContext></ExtensionAttributes> The ExtentionAttributes property above shows that the probe is using 127.0.0.1 for connecting to port 25. As that is the loopback address, the administrator needs to change the SMTP server in ExtentionAttributes property to enable the probe to succeed. You use following command to create a local override, and change the SMTP server to the hostname instead of loopback IP address. Add-ServerMonitoringOverride -Server ServerName -Identity FrontEndTransport\OnPremisesInboundProxy -ItemType Probe -PropertyName ExtensionAttributes -PropertyValue '<ExtensionAttributes><WorkContext><SmtpServer>Exch1.contoso.com</SmtpServer><Port>25</Po rt><HeloDomain>InboundProxyProbe</HeloDomain><MailFrom Username="inboundproxy@contoso.com" /><MailTo Select="All" Username="HealthMailboxdd618748368a4935b278e884fb41fd8a@FM.com" /><Data AddAttributions="false">X-Exchange-Probe-Drop-Message:FrontEnd-CAT-250Subject:Inbound proxy probe</Data><ExpectedConnectionLostPoint>None</ExpectedConnectionLostPoint></WorkContext> </ExtensionAttributes>' -Duration 45.00:00:00 The probe will be created on the specified server in following registry path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides\Probe You can use following command to verify if the probe is effective: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundProxy"} Creating a Global Override In this example, the organization has an EWS application that is keeping the EWS app pools busy for complex queries. The administrator discovers that the EWS App pool is recycled during long running queries, and that the EWSProxyTestProbe probe is failing. To find out the details of EWSProxyTestProbe, run the following: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "EWSProxyTestProbe"} Next, change the timeout interval for EWSProxyTestProbe to 25 seconds on all servers running Exchange Server 2013 RTM CU2: Use following command to get version information for Exchange 2013 RTM CU2 servers: Get-ExchangeServer | ft name,admindisplayversion 1. Use following command to create a new global override: Add-GlobalMonitoringOverride -Identity "EWS.Proxy\EWSProxyTestProbe" -ItemType Probe -PropertyName TimeoutSeconds -PropertyValue 25 –ApplyVersion “15.0.712.24” Override Durations Either of above mentioned overrides can be created for a specific Duration or for a specific Version of Exchange servers. The override created with Duration parameter will be effective only for the period mentioned. And maximum duration that can be specified is 60 days. For example, an override created with Duration 45.00:00:00 will be effective for 45 days since the time of creation. The Version specific override will be effective as long as the Exchange server version matches the value specified. For example, an override created for Exchange 2013 CU1, with version “15.0.620.29” will be effective until the Exchange server version changes. The override will be ineffective if the Exchange server is upgraded to different version of Cumulative Update or Service Pack. Hence, if you need an override in effect for longer period, create the override using ApplyVersion parameter. Removing an Override Finally, this last example shows how to remove the local override that was created for the OnPremisesInbounxProxy probe. Remove-ServerMonitoringOverride -Server ServerName -Identity FrontEndTransport\OnPremisesInboundProxy -ItemType Probe -PropertyName ExtensionAttributes Conclusion Managed Availability performs gradual recovery actions to automatically recover from failure scenarios. The overrides help you customize the configuration of the Managed Availability components to suite to your environment. The steps mentioned in the document can be used to customize Monitors & Probes as required. Special thanks to Abram Jackson, Scott Schnoll, Ben Winzenz, and Nino Bilic for reviewing this post. Bhalchandra Atre Da <http://blogs.technet.com/b/exchange/archive/2013/08/13/customizing-managed-availability.aspx> Managed availability in Exchange 2013 The release of Exchange 2013 brought us another gem to the precious set of Exchange functionalities, Managed Availability is also known as Active Monitoring or Local Active Monitoring (LAM). Briefly speaking, it is an in-built Exchange monitoring system, which automatically analyses mail server components, and in case of any detected errors or corruptions attempts to fix them (e.g. switches a mailbox database to another server, etc.) Monitoring Exchange server health and performance using managed availability The structure of Managed Availability consists of three components: 1. Probing 2. Monitoring 3. Responder Probing carries out a multiple tests on particular mail server services (e.g. client protocols, storage data services responsible for mail flow, data migration or storing). They are carried out on the basis of:    performance tests – they verify the response value of a particular service according to predefined performance thresholds (e.g. what is the response time of Exchange ActiveSync service), health tests – they check the status of a service (active or not responding), exception tests – they verify if there are any exceptional events in running services. The probe configuration cannot be changed. Probing may generate itself an analysis report (e.g. in a form of an entry in event logs) or forward it to the monitoring component. Results of probing can be found in Event Viewer: Event Viewer -> Applications and Services Logs -> Microsoft -> Exchange -> ActiveMonitoring –ProbeResult Monitoring is the core component and holds a decisive role in the Managed Availability structure. It is responsible for the data analysis (gathered by the probe component) and determines the action to be taken on a monitored service or Exchange component, what results in creating notifications in Event Logs, or sending an information to the Responder component to execute a command (e.g. a service restart). Monitor surveys the state of a particular Exchange component and may indicate the following states:       healthy state – is indicated when gathered data presents information on no anomalies regarding a monitored Exchange component, unhealthy state – is indicated when there is a problem concerning a monitored component, degraded state – is shown when the monitor indicates inappropriate behavior of a service within the time limit of 60 seconds, disabled state – when a monitor is disabled as a result of administrative actions, unavailable – a monitor is unable to analyze a component or service, repairing – happens when Managed Availability is attempting to repair a component. A responder is responsible for taking actions on components analyzed by the monitoring component. Such actions include: a service or server restart, entries in event logs, IIS reset, switching of a mailbox to a different database or databases to a different server, turning a service offline or online what may result in rejection or acceptance of client requests by a service. The physical structure In a strictly technical sense Managed Availability is based on two processes: 1. msexchangehmworker.exe – this process monitors the state of Exchange 2013 components 2. msexchangehmhost.exe (Exchange Health Manager Service) – it manages worker processes The second process (msexchangehwhost.exe) is more important, if it goes down, the whole Managed Availability component will also go down. The screenshot below presents both processes in Task Manager: Microsoft doesn’t recommend to turn off any of the Managed Availability components, as it may limit the availability of some elements or affect the whole Exchange 2013 server system. However, there may be a situation that we would like to turn off one of the Managed Availability’s functionalities (e.g. in case that we may suspect that it somehow affects performance and stability of our server). We shouldn’t do it by terminating the Exchange Health Manager Service, but by using the cmdlet called Set-ServerComponentState. For example to turn off the monitoring feature in Managed Availability, we need to execute the command below: Set-ServerComponentState -Identity <server_name> -Component Monitoring -Requester Functional State Inactive Overrides As it was mentioned before, the Monitoring component analyses data gathered during the process of probing. The analysis is based on the comparison of the gathered data results with the predefined thresholds (of certain service checks), what demarks the line between correct or incorrect service behaviors. In case a particular component is recognized as working improperly on the basis of analysis, an appropriate log is recorded in event logs, or a specific action is forwarded to Responder, which attempts to reclaim healthy state of a malfunctioned service. However, there is a possibility to change the predefined thresholds and actions that are sent to responder. We can set values that would fit to our Exchange 2013 environment. The changed values are called Overrides. For example, Cumulative Update installation to Exchange server may cause that during probing some services will be incorrectly informing about their current state. Usually, the simplest way to reclaim the proper state of services probing is to restart all monitored services. In this case setting non-standard Override values will restart services when the monitoring component receives information on their improper behavior. Override values can be set globally for the entire Exchange organization, or locally for a single server. The local override configuration is hold in local server registry: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides\ The global override configuration can be found in Active Directory: CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=Example,DC=com Whether we want to configure overrides on global or local level, we should use two following cmdlets: Add-GlobalMonitoringOverride Add-ServerMonitoringOverride Log entries Apart from its basic function of repairing Exchange services, Managed Availability also holds (in registries) logs of probing, monitoring and responder actions. These registries can be found here: Event Viewer -> Applications and Services Logs -> Microsoft -> Exchange -> Active Monitoring Event Viewer -> Applications and Services Logs -> Microsoft -> Exchange -> Managed Availability In ActiveMonitoring we can find all information on probes, monitors and responders configuration, and also results of their activity. Managed Availability holds information about all undertaken repair attempts by this component. Health Mailboxes In Exchange 2013, Managed Availability uses so called Health Mailboxes in order to carry out simulations of users’ actions like sending or receiving messages. Each Active Directory account is associated with these mailboxes. Health Mailboxes implementation have been evolving together with Exchange 2013. Before Cumulative Update 6 there was only one Health Mailbox installed for each database and Client Access Server (CAS). The appearance of Cumulative Update 6 caused that for one database installed on a Mailbox Server there is one Health Mailbox, however, for each Client Server there are now 10 Health Mailboxes. These mailboxes are kept in the Active Directory container: In order to display Health Mailboxes in PowerShell, we type in the following cmdlet: Get-mailbox -monitoring | ft name,database Health Mailbox is a simple mailbox which is associated with an Active Directory user. The display name of a user with associated Health Mailbox that belongs to CAS server: HealthMailbox-CAS_server_name-consecutive_mailbox_number_within_range_001-010 The screenshot below illustrates an example: The display name for an Active Directory user with a Health Mailbox associated with a database: HealthMailbox-server_name_MBX-data_base_name An example: In everyday work with Health Mailboxes there may be two scenarios that would require administrator’s intervention. The scenario number one is a corrupted Health Mailbox. It may appear when a database associated with this mailbox is deleted by an administrator; and a user account that refers to such mailbox becomes “orphaned” due to no connection to any object. The best solution is to delete this orphaned account in Active Directory and to restart the Health Manager service. Another scenario requiring administrator’s attention are lockouts on users Active Directory accounts associated with Health Mailboxes. Whenever, an account is locked out, Managed Availability is not able to perform any tests which involve simulations of Exchange users’ actions. A lockout is a result of Password and Account Lockout Policies of an organization, and is put on accounts associated with Health Mailboxes installed in Monitoring Mailboxes container. Passwords to these accounts are changed by Health Mailbox Worker and consist of 128-digit-length signs, which in some cases may not fulfill passwords policy, what will result in lockouts of these accounts (accordingly with Account Lockout Policies). That is why Microsoft recommends not to include any accounts (contained in Monitoring Mailbox) in passwords policies. What’s more, it is better not to:        move users from Monitoring Mailboxes to other containers or organizational units, change account properties in Monitoring Mailboxes, disable accounts (in Monitoring Mailboxes) in organizations’ Passwords and Account Lockout Policies, change inheritance on AD objects, move Health Mailboxes between databases, put Health Mailboxes in quotes in case of retention policies, delete data in Health Mailboxes before at least 30 days. The usage of Managed Availability Type in the following command in Exchange Management Shell (EMS) to verify the status of particular components in Exchange organizations: Get-HealthReport –Identity Exchange_server_name As we can observe in the screenshot below, Get-HealthReport displays the status of some of the HealthSets. A single health set is a list of probes, monitors and responders, organized into logical set which addresses a particular service or component in Exchange server. In order to show all health sets with the Unhealthy status, execute the following command: Get-HealthReport -Identity server_name | Where-Object {$_.AlertValue -eq ‘Unhealthy’} The displayed HealthSet called MailboxTransport is shown as Unhealthy. We want to check which one of the monitors reports this status using the command below: Get-ServerHealth -Identity server_name –HealthSet healthSet_name The monitor called Mapi.Submit.Monitor is the one responsible for the status of the health set which refers to MailboxTransport. To verify the configuration of Mapi.Submit.Monitor, we should display records placed in event logs called ActiveMonitoring/MonitorDefinition. We may look for this data through events logs graphical interface or just simply use the following command (recommended): (Get-WinEvent -ComputerName server_name -LogName Microsoft-ExchangeActiveMonitoring/MonitorDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -eq " monitor_name"} Now we want to check which probe feeds the data into this monitor, to do it we should explore the value of the property called SampleMask. In this case it is Mapi.Submit.Probe. Next, using the event logs we extract all error logs concerning this particular probe (Mail.Submit.Probe). To achieve this we will use this cmdlet: $errRecords = (Get-WinEvent -ComputerName domainA-mail -LogName Microsoft-ExchangeActiveMonitoring/ProbeResult -FilterXPath "*[UserData[EventXML[ResultName='Name/ResourceType'][ResultType='4']]]" | % {[XML]$_.toXml()}).event.userData.eventXml We will need Name/ResourceType to the above command. Let’s use: Get-MonitoringItemIdentity -Identity HealthSet_name -server server_name | select HealthSetName,Name,TargetResource,ItemType The Get-MonitoringItemIdentity cmdlet allows to display probes, monitors and responders associated to a particular health set. The given screenshot proves that the Name/ResourceType format will become Mapi.Submit.Probe as this probe is not associated with any ResourceType. Therefore, the cmdlet that gathers all error logs from event logs connected with Mapi.Submit.Probe will look like this: $errRecords = (Get-WinEvent -ComputerName domainA-mail -LogName Microsoft-ExchangeActiveMonitoring/ProbeResult -FilterXPath "*[UserData[EventXML[ResultName='Mapi.Submit.Probe'][ResultType='4']]]" | % {[XML]$_.toXml()}).event.userData.eventXml In order to display the error that caused the TransportMailbox health set is Unhealthy, we should filter the $errRecords with the following cmdlet: $errRecords | select -Property *time,result*,error*,*context The above screenshot informs that issues are caused by the delays between the Store and Submission components during the test sending of a message. Let’s check what is the repair method undertaken by Managed Availability. It is important to check which responder is connected with Mapi.Submit.Monitor. In this case let’s use the cmdlet: (Get-WinEvent –ComputerName server_name -LogName Microsoft-ExchangeActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml | ? {$_.AlertMask -like “*monitor_name*”} | fl Name,AlertMask,EscalationSubject,EscalationMessage,UpdateTime The responder we were looking for is Mapi.Submit.EscalateResponder as suggested by the screenshot above. This type of responder (Escalate) doesn’t make Managed Availability to undertake any automatic repairs, but is responsible for log notifications in event logs. The bottom line Managed Availability is a powerful component that ensures automatic monitoring, appropriate log entries and repairing of improperly working components and services in Exchange 2013. After the installation of Exchange server, Managed Availability doesn’t require any configuration to work. However, there are situations which require administrators to change the default settings in order to neutralize the improper reporting and automatic attempts to repair services and components which work properly. Managed Availability processes huge amount of data what makes it hard for an administrator to extract specific information. As we have proved it can be done (not easily though). Such analysis of data helps in better understanding of monitoring processes, what they consist of, and most importantly what to do when Exchange 2013 starts to work improperly. Suggested reading:     Read about repairing mailboxes in Exchange with New-MailboxRepairRequest How to manage Role Based Access Control in Exchange 2013? User mailbox and shared mailbox auditing in Exchange 2013 Check out CodeTwo software for Exchange and Office 365 Managed availability in Exchange 2013 by Adam the 32-bit Aardvark Da <https://www.codetwo.com/admins-blog/managed-availability-in-exchange-2013/> Managed Availability HealthSet Troubleshooting Knowing how to deal with annoying HealthSet's was scary at the beginning of my experience with Exchange 2013. The new introduced feature called Managed Availability, is a built-in monitoring system that can take recovery actions that can cause serious issues trying to solve a small issue. There are 3 types of components that can be related to HealthSet's   Probe : used to determine if Exchange components are active Monitor : when probes signal a different state then the one stored in the patters of the monitoring engine, monitoring engine will determine whether a component or feature is unhealthy.  Responder : will take action if a monitor will alert the responder about an unhealthy state. Responders will take different actions depending on the type of component or feature. Actions can start with just recycling the application pool and can go to as far as restarting the server or even worse putting the server offline so it won't accept any connections. In this Blog post I will talk about troubleshooting different HealthSet's in Exchange 2013. Troubleshooting Exchange HealthSet MailboxSpace We should start with getting a health report for the Exchange 2013 server by using the Get-HealthReport cmdlet Get-HealthReport -Identity EXCH2K13 Image 1 If you want to list only those HealthSet's that are Unhealthy, Degraded, Disabled you can use this cmdlet : Get-HealthReport -Server EXCH2K13| where { $_.alertvalue -ne "Healthy" } Let's list a couple of these components for the MailboxSpace HealthSet Get-MonitoringItemIdentity -Identity MailboxSpace -Server EXCH2K13 | ft Identity,ItemType,TargetResource -autosize Image 2 As you can see the HealthSet has multiple Probes, Monitors, Responders. What if you have a HealthSet with status Unhealthy or Repairing like the MailboxSpace for a Test DB ? We need to investigate further to check what are the monitors that are causing the HealthSet to go into Unhealthy state. Get-ServerHealth -Identity EXCH2K13 -HealthSet "MailboxSpace" Image 3 As you can see above, a lot of Monitors are Unhealthy. Assuming that in your production environment you have a TEST DB located on C: drive that you probably don't want to move or delete but because of limited space available on C you are getting these Unhealthy monitors. We can use Add-ServerMonitoringOverride to disable these monitors. Add-ServerMonitoringOverride http://technet.microsoft.com/en-us/library/jj218628(v=exchg.150).aspx The limitations for this is the 60 days limit for a server override Add-ServerMonitoringOverride -Duration 60.00:00:00 -Identity ProbeMonitorResponderName ItemType Monitor -PropertyName Enabled -PropertyValue 0 Using the result we got in Image2 with Get-MonitoringItemIdentity and combining that with Get-ServerHealth we will identify Monitors that need to be overridden. We have the following Monitors with Unhealthy or Repairing state : MailboxSpace\DatabaseLogicalPhysicalSizeRatioEscalationNotification\DB01 MailboxSpace\DatabaseLogicalPhysicalSizeRatioEscalationNotification\DB02 MailboxSpace\DatabaseLogicalPhysicalSizeRatioEscalationProcessingMonitor MailboxSpace\DatabaseSizeMonitor\DB01 MailboxSpace\DatabaseSizeMonitor\DB02 MailboxSpace\Stora-PrgeLogicalDriveSpaceMonitor\C: Add-ServerMonitoringOverride -ItemType Monitor -Identity "MailboxSpace\DatabaseLogicalPhysicalSizeRatioEscalationNotification\DB01" -PropertyValue 0 -PropertyName Enabled -Duration "60.00:00:00" -Server EXCH2K13 Add a server override for all the Monitors above, please make sure of the ItemType if it's Probe, Monitor, Responder At the end you can verify your Server Overrides with Get-ServerMonitoringOverride Image 4 Now we should check ServerHealth to see if the Monitors have been disabled Get-ServerHealth -Identity EXCH2K13 -HealthSet "MailboxSpace" | ft -Autosize Image 5 MailboxSpace HealthSet is Healthy now. Image 6 Troubleshooting FEP HealthSet Some of you don't have ForeFront installed so you would want to disable this HealthSet on the server. We will achieve this simply by changing the xml file that corresponds to FEP Health set Browse to C:\Program Files\Microsoft\Exchange\V15\Bin\Monitoring\ Search for FEPActiveMonitoringContext. Open the file with Notepad Change Line 12 : Enabled = “True” Replace TRUE with FALSE to disable FEP monitoring. The file should look something like this : <?xml version="1.0" encoding="iso-8859-1"?> < Definition xsi:noNamespaceSchemaLocation="..\..\WorkItemDefinition.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <!–FEPService Maintenance definition section–> <MaintenanceDefinition AssemblyPath="Microsoft.Exchange.Monitoring.ActiveMonitoring.Local.Components.dll" TypeName="Microsoft.Exchange.Monitoring.ActiveMonitoring.FEP.FEPDiscovery" Name="FEP.Maintenance.Workitem" ServiceName="FEP" RecurrenceIntervalSeconds="0" TimeoutSeconds="30" MaxRetryAttempts="0" Enabled = "false"> After you modify the above line you should restart Microsoft Exchange Health Management service on the server where you modified the xml file Troubleshooting CAS Proxy HealthSet's What if you have TMG in your Organization and you need to set OWA/ECP with Basic Authentication You will probably disable Forms Authentication on OWA and ECP Soon after you have disabled forms Authentication you will start seeing that some server components will go in inactive state like OWA.Proxy, ECP.Proxy , RWS.Proxy You can check with : Get-ServerComponentState -Identity EXCH2K13 Image 7 We can set the component back to Active manually by running this cmdlet : Set-ServerComponentState -Identity EXCH2K13 -Component EcpProxy -State Active -Requester HealthAPI After 1 hour the components will return to an Inactive state. If we continue forward with troubleshooting and check Crimson Logs on your server you will find events related to ECP.Proxy Probe. More information about Crimson channel event logging can be found here http://technet.microsoft.com/en-us/library/dd351258(v=exchg.150).aspx#Crimson Event Viewer > Application and Services Logs > Microsoft > Exchange > ActiveMonitoring > ProbeResult Find the event related to Probe Result (Name=ECPProxyTestPRobe/MSExchangeECPAppPool) select Details and at StateAttribute3 you will see "FailurePoint=FrontEnd,HttpStatusCode=401,Error=Unauthorized,Details=,HttpProxySubErrorCo de=,WebExceptionStatus=,LiveIdAuthResult=" ECP.Proxy Probe is failling with 401 Unauthorized error, credential used can be seen at StateAtrribute2 Verify HealthSet for ECP and OWA Get-HealthReport -Server EXCH2K13 You will see the ECP,OWA,ECP.Proxy,OWA.Proxy,RWS Proxy HealthSet's are Unhealthy To remove this behavior we can disable the Monitoring Probes for OWA, ECP , RWS Open Windows Explorer and browse to : C:\Program Files\Microsoft\Exchange Server\V15\Bin\Monitoring\Config\ Open ClientAccessProxyTest.xml with Notepad Change the "true" value of the following Monitoring Probes ECPProbeEnabled = "false" OWAProbeEnabled = "false" ReportingProbeEnabled = "false" Save the ClientAccessProxyTest.xml and close it Restart Microsoft Exchange Health Manager on the server where you modified xml file Disabling the Monitoring Probes has no impact on the Exchange Servers Proxy functionality. If you want to modify any other settings to the xml files locate in Bin\Monitoring\Config\ please consult a Microsoft Exchange Support Engineer before doing any modifications to those files. To conclude this the problem is with the Authentication method used on the IIS sites ECP, OWA. Monitoring Probes can only use Forms Based Authentication and Windows Authentication to test ECP , OWA , RWS functionality. I hope the information provided was helpful for you. If you have any questions please feel free to send an email to a-crtimo@microsoft.com. Tags ECP.Proxy Exchange 2013 HealthSet Managed Availability OWA.Proxy Da <https://blogs.technet.microsoft.com/ehlro/2014/02/20/exchange-2013-managed-availability-healthset-troubleshooting/> Managed Availability and Server Health Every second on every Exchange 2013 server, Managed Availability polls and analyzes hundreds of health metrics. If something is found to be wrong, most of the time it will be fixed automatically. But of course there will always be issues that Managed Availability won’t be able to fix on its own. In those cases, Managed Availability will escalate the issue to an administrator by means of event logging, and perhaps alerting if System Center Operations Manager is used in tandem with Exchange 2013. When an administrator needs to get involved and investigate the issue, they can begin by using the GetHealthReport and Get-ServerHealth cmdlets. Server Health Summary Start with Get-HealthReport to find out the status of every Health Set on the server: Get-HealthReport –Identity <ServerName> This will result in the following output (truncated for brevity): Server State HealthSet AlertValue LastTransitionTime MonitorCount —— —— —— —— —— —— Server1 NotApplicable AD Healthy 5/21/2013 12:23 14 Server1 NotApplicable ECP Unhealthy 5/26/2013 15:40 2 Server1 NotApplicable EventAssistants Healthy 5/29/2013 17:51 40 Server1 NotApplicable Monitoring Healthy 5/29/2013 17:21 9 … … … … … … In the above example, you can see that that the ECP (Exchange Control Panel) Health Set is Unhealthy. And based on the value for MonitorCount, you can also see that the ECP Health Set relies on two Monitors. Let’s find out if both of those Monitors are Unhealthy. Monitor Health The next step would be to use Get-ServerHealth to determine which of the ECP Health Set Monitors are in an unhealthy state. Get-ServerHealth –Identity <ServerName> –HealthSet ECP This results in the following output: Server State Name TargetResour HealthSetNa ce me AlertValu ServerCompone e nt —— —— —— —— —— —— —— Server NotApplicab EacSelfTestMonito 1 le r ECP Unhealth None y Server NotApplicab EacDeepTestMonit 1 le or ECP Unhealth None y As you can see above, both Monitors are Unhealthy. As an aside, if you pipe the above command to Format-List, you can get even more information about these Monitors. Troubleshooting Monitors Most Monitors are one of these four types: The EacSelfTestMonitor Probes along the "1" path, while the EacDeepTestMonitor Probes along the "4" path. Since both are unhealthy, it indicates that the problem lies on the Mailbox server in either the protocol stack or the store. It could also be a problem with a dependency, such as Active Directory, which is common when multiple Health Sets are unhealthy. In this case, the Troubleshooting ECP Health Set topic would be the best resource to help diagnose and resolve this issue. Da <https://blogs.technet.microsoft.com/exchange/2013/06/26/managed-availability-and-server-health/> Managed Availability Probes Probes are one of the three critical parts of the Managed Availability framework (monitors and responders are the other two). As I wrote previously, monitors are the central components, and you can query monitors to find an up-to-the-minute view of your users’ experience. Probes are how monitors obtain accurate information about that experience. There are three major categories of probes: recurrent probes, notifications, and checks. Recurrent Probes The most common probes are recurrent probes. Each probe runs every few minutes and checks some aspect of service health. They may transmit an e-mail to a monitoring mailbox using Exchange ActiveSync, connect to an RPC endpoint, or establish CAS-to-Mailbox server connectivity. All of these probes are defined in the Microsoft.Exchange.ActiveMonitoring\ProbeDefinition event log channel each time the Exchange Health Manager service is started. The most interesting properties for these events are: Name: The name of the Probe. This will begin with the SampleMask of the Probe’s Monitor.  TypeName: The code object type of the probe that contains the probe’s logic.  ServiceName: The name of the Health Set for this Probe.  TargetResource: The object this Probe is validating. This is appended to the Name of the Probe when it is executed to become a Probe Result ResultName  RecurrenceIntervalSeconds: How often this Probe executes.  TimeoutSeconds: How long this Probe should wait before failing. On a typical Exchange 2013 multi-role server, there are hundreds of these probes defined. Many probes are per-database, so this number will increase quickly as you add databases. In most cases, the logic in these probes is defined in code, and not directly discoverable. However, there are two probe types that are common enough to describe in detail, based on the TypeName of the probe:  Microsoft.Exchange.Monitoring.ActiveMonitoring.ServiceStatus.Probes.GenericServiceProb e: Determines whether the service specified by TargetResource is running.  Microsoft.Exchange.Monitoring.ActiveMonitoring.ServiceStatus.Probes.EventLogProbe: Logs an error result if the event specified by ExtensionAttributes.RedEventIds has occurred in the ExtensionAttributes.LogName. Success results are logged if the ExtensionAttributes.GreenEventIds is logged. These probes will not work if you override them to watch for a different event. The basics of a recurrent probe are as follows: start every RecurrenceIntervalSeconds and check (or probe) some aspect of component health. If the component is healthy, the probe passes and writes an informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult channel with a ResultType of 3. If the check fails or times out, the probe fails and writes an error event to the same channel. A ResultType of 4 means the check failed and a ResultType of 1 means that it timed out. Many probes will re-run if they timeout, up to the MaxRetryAttempts property.  The ProbeResult channel gets very busy with hundreds of probes running every few minutes and logging an event, so there can be a real impact on the performance of your Exchange server if you perform expensive queries against this event channel in a production environment. Notifications Notifications are probes that are not run by the health manager framework, but by some other service on the server. These services perform their own monitoring, and then feed data into the Managed Availability framework by directly writing probe results. You will not see these probes in the ProbeDefinition channel, as this channel only describes probes that are run within the Managed Availability framework. For example, the ServerOneCopyMonitor Monitor is triggered by Probe results written by the MSExchangeDagMgmt service. This service performs its own monitoring, determines whether there is a problem, and logs a probe result. Most Notification probes have the capability to log both a red event that turns the Monitor Unhealthy and a green event that make the Monitor healthy once more. Checks Checks are probes that only log events when a performance counter passes above or below a defined threshold. They are really a special type of Notification probe, as there is a service monitoring the performance counters on the server and logging events to the ProbeResult channel when the configured threshold is met. To find the counter and threshold that is considered unhealthy, you can look at Monitor Definitions with a Type property of: · Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueAboveThreshold Monitor or · Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueBelowThreshold Monitor This means that the probe the Monitor watches is a Check probe. How this works with Monitors From the Monitor’s perspective, all three probe types are the same as they each log to the ProbeResult channel. Every Monitor has a SampleMask property in its definition. As the Monitor executes, it looks for events in the ProbeResult channel that have a ResultName that matches the Monitor’s SampleMask. These events could be from recurrent probes, notifications, or checks. If the Monitor’s thresholds are reached or exceeded, it becomes Unhealthy. It is worth noting that a single probe failure does not necessarily indicate that something is wrong with the server. It is the design of Monitors to correctly identify when there is a real problem that needs fixing versus a transient issue that resolves itself or was anomalous. This is why many Monitors have thresholds of multiple probe failures before becoming Unhealthy. Even many of these problems can be fixed automatically by Responders, so the best place to look for problems that require manual intervention is in the Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel. These events sometimes also include the most recent probe error message (if the developers of that Health Set view it as relevant when they get paged with that event’s text in Office 365). There are more details on how Monitors work, and how they can be overridden to use different thresholds in the Managed Availability Monitors article. Abram Jackson Program Manager, Exchange Server Da <https://blogs.technet.microsoft.com/exchange/2014/08/11/managed-availability-probes/> Managed Availability Monitors Monitors are the central component of Managed Availability. They define what data to collect, what constitutes the health of a feature, and what actions to take to restore a feature to good health. Because there are several different aspects to Monitors, it can be hard to figure out how a specific Monitor works. All of the properties discussed in this article can be found in the Monitor’s definition event in the Microsoft.Exchange.ActiveMonitoring\MonitorDefinition crimson channel are of the Windows event log. See this article for how these definitions can be easily collected. What Data is Collected? Nearly all Monitors collect one of three types of data: direct notifications, Probe results, or performance counters. Monitors that change states based on a direct notification only get data from the notification. Monitors based on Probe results become unhealthy when some Probes fail. There are two main types of these Monitors, those based on a number of consecutive Probe failures, and those based on a number of Probes failing over an interval. Monitors based on performance counters simply determine if a counter is higher or lower than the builtin defined threshold for the required time. The TypeName property of a Monitor definition indicates what data it is collecting and the kind of threshold must be reached before it is considered Unhealthy. Here are the most common types with what they use: OverallPercentSuccessMonitor Looks at the results of all probes matching the SampleMask property and calculates the aggregate percent success over the past MonitoringIntervalSeconds. Becomes Unhealthy if the calculated percent success is less than the MonitoringThreshold. OverallConsecutiveProbeFailuresMonitor Looks at the last X probe results as configured in MonitoringThreshold that match the SampleMask. Becomes Unhealthy if all of those results are failures. OverallXFailuresMonitor Looks at the results of all probes matching the SampleMask property over the past MonitoringIntervalSeconds. Becomes Unhealthy if at least X results as configured in MonitoringThreshold are failures. OverallConsecutiveSampleValueAboveThreshol dMonitor Looks at the last X performance counter results as configured in SecondaryMonitoringThreshold matching Sample Mask over the past MonitoringIntervalSeconds. Becomes Unhealthy if at least X performance counters are above the threshold configured in MonitoringThreshold. Healthy or Not One more thing must happen before the Monitor will become Unhealthy. The code for individual Monitors that checks the threshold only runs every X seconds, where X is specified by the RecurrenceIntervalSeconds property. The threshold is checked only when the Monitor runs. As soon as the Monitor runs while the threshold is met, the Monitor becomes Unhealthy. GetServerHealth will report that the Monitor is Degraded for the first 60 seconds, but the functional behavior of the Monitor does not have a concept of being Degraded; it is either Healthy or Unhealthy. The Health Set that a Monitor is part of is defined by the Monitor’s ServiceName property. If any Monitor is Unhealthy, the entire Health Set will be marked as Unhealthy as viewed from Get-HealthReport or via System Center Operations Manager (SCOM). Responder Timeline The StateTransitionXML property of a Monitor definition indicates which Responders execute and when, as each Responder is tied to a transition state of the Monitor. Let’s consider a Monitor that has this value for its StateTransitionXML property: <StateTransitions> <Transition <Transition <Transition <Transition ToState="Unhealthy" TimeoutInSeconds="0" /> ToState="Unhealthy1" TimeoutInSeconds="30" /> ToState="Unhealthy2" TimeoutInSeconds="330" /> ToState="Unrecoverable" TimeoutInSeconds="1500" /> </StateTransitions> As soon as the Monitor runs while its defined threshold is met, it will transition to the “Unhealthy” state. These transition states are only used for internal consumption. Although they share a term, the Monitor can only be Healthy or Unhealthy from an external perspective. Any Responders set to execute when this Monitor is in this transition state will now execute. After 30 more seconds, any Responders set to execute when the Monitor is in the “Unhealthy1” state will now execute. The next Responder will be 300 seconds later (for a total of 330 seconds) when the Monitor is set to the “Unhealthy2” state. The transition state each Responder is tied to is set by the TargetHealthState property on a Responder definition, which is an integer. Here are the transition states that the integer indicates: 0 None 1 Healthy 2 Degraded 3 Unhealthy 4 Unrecoverable 5 Degraded1 6 Degraded2 7 Unhealthy1 8 Unhealthy2 9 Unrecoverable1 10 Unrecoverable2 We call all these Responders that are tied to a Monitor transition states a Responder chain. As a Monitor’s threshold continues to be met, stronger and stronger Responders execute until the Monitor determines it is Healthy or an administrator is notified via event log escalation. If the code for this Monitor runs while it is in the “Unhealthy1” state and the threshold is no longer met, the Monitor will immediately transition to None. No more Responders will execute. Get-ServerHealth would again report this Monitor as Healthy. Abram Jackson Program Manager, Exchange Server Da <https://blogs.technet.microsoft.com/exchange/2013/07/16/managed-availability-monitors/> Managed Availability Responders Responders are the final critical part of Managed Availability. Recall that Probes are how Monitors obtain accurate information about the experience your users are receiving. Responders are what the Monitors use to attempt to fix the situation. Once they pass throttling, they launch a recovery action such as restarting a service, resetting an IIS app pool, or anything else the developers of Exchange have found often resolve the symptoms. Refer to the Responder Timeline section of the Managed Availability Monitors article for information about when the Responders are executed. Definitions and Results Just like Probes and Monitors, Responders have an event log channel for their definitions and another for their results. The definitions can be found in Microsoft-ExchangeActiveMonitoring/ResponderDefinition. Some of the important properties are:       TypeName: The full code name of the recovery action that will be taken when this Responder executes. Name: The name of the Responder. ServiceName: The HealthSet this Responder is part of. TargetResource: The object this Responder will act on. AlertMask: The Monitor for this Responder. ThrottlePolicyXml: How often this Responder is allowed to execute. I’ll go into more details in the next section. The results can be found in Microsoft-Exchange-ActiveMonitoring/ResponderResult. Responders output a result on a recurring basis whether or not the Monitor indicates they should take a recovery action. If a ResponderResult event has a RecoveryResult of 2 and IsRecoveryAttempted of 1, the Responder attempted a recovery action. Usually, you will want to instead skip looking at the Responder results and go straight to Microsoft-ExchangeManagedAvailability/RecoveryActionResults, but let’s first discuss the events in the MicrosoftExchange-ManagedAvailability/RecoveryActionLogs event log channel. Throttling When a recovery action is attempted by a Responder, it is first checked against throttling limits. This will result in one of two events in the RecoveryActionLogs channel: 2050, throttling has allowed the operation, or 2051, throttling rejected the operation. Here’s a sample of a 2051 event: In the details, you will see: ActionId RestartService ResourceName MSExchangeRepl RequesterName ServiceHealthMSExchangeReplEndpointRestart ExceptionMessage Active Monitoring Recovery action failed. An operation was rejected during local throttling. (ActionId=RestartService, ResourceName=MSExchangeRepl, Requester=ServiceHealthMSExchangeReplEndpointRestart, FailedChecks=LocalMinimumMinutes, LocalMaxInDay) LocalThrottleResult <LocalThrottlingResult IsPassed="false" MinimumMinutes="60" TotalInOneHour="1" MaxAllowedInOneHour="-1" TotalInOneDay="1" MaxAllowedInOneDay="1" IsThrottlingInProgress="true" IsRecoveryInProgress="false" ChecksFailed="LocalMinimumMinutes, LocalMaxInDay" TimeToRetryAfter="2015-02-11T14:29:57.9448377-08:00"> <MostRecentEntry Requester="ServiceHealthMSExchangeReplEndpointRestart" StartTime="201502-10T14:29:55.9920032-08:00" EndTime="2015-02-10T14:29:57.944837708:00" State="Finished" Result="Succeeded" /> </LocalThrottlingResult> GroupThrottleResult <not attempted> TotalServersInGroup 0 TotalServersInCompatibleVersion 0 Hopefully, you recognize the first few fields. This is the RestartService recovery action, which restarts a service. The ResourceName is used by the recovery action to pick a target; for the RestartService recovery action, it is the name of the service to restart. The RequesterName is the name of the Responder, as listed in the ResponderDefinition or ResponderResult channels. The LocalThrottleResult property is more interesting. Recovery actions are throttled per server, where the same recovery action cannot run too often on the same server, and per group, where the same recovery action cannot run too often on the same DAG (for the Mailbox role) or AD site (for the Client Access role). If a value is -1, this level of throttling is not used; for example, MaxAllowedInOneHour is not interesting if only 1 action is allowed per day. In this example, the MSExchangeRepl resource was already the target of a recovery action within the last 60 minutes, and so the recovery action did not pass the LocalMinimumMinutes throttling. As this recovery action attempt was blocked by local throttling, the group throttling was not attempted. This table has a description of each of the limits mentioned in this event: ThrottlingResult attribute IsPassed Local throttle config attribute name Group throttle config attribute name Description True if throttling will allow the recovery action. Otherwise, false. LocalMinimumMinutesBet GroupMinimumMinute The time that must elapse before this recovery action may act upon weenAttempts sBetweenAttempts LocalMinimumMinutes, the same resource on this server or in this group. GroupMinimumMinute s MinimumMinutes, TotalInOneHour The number of times this recovery action has acted upon this resource on this server or in this group in the last hour. MaxAllowedInOneHour LocalMaximumAllowedAtt n/a , emptsInOneHour The number of times this recovery action is allowed to act upon this resource on this server or in this group in one hour. LocalMaxInHour The number of times this recovery action has acted upon this resource on this server or in this group in the last 24 hours. TotalInOneDay MaxAllowedInOneDay, LocalMaxInDay, GroupMaxInDay LocalMaximumAllowedAtt GroupMaximumAllowe The number of times this recovery action is allowed to act emptsInADay dAttemptsInADay upon this resource on this server or in this group in 24 hours. IsRecoveryInProgress, RecoveryInProgress, GroupRecoveryInProgr ess TimeToRetryAfter Whether this recovery action is already acting upon this resource and has not completed. If True, the new action will be aborted. The time after which this recovery action would be allowed to act on this resource on this server or in this group. The GroupThrottleResult has the same fields, and also gives details about the recovery actions that have taken place on the other servers in the group. If the action is not throttled, event 500 will be logged in the Microsoft-ExchangeManagedAvailability/RecoveryActionResults channel, indicating that the recovery action is beginning. If it succeeds, event 501 is logged. This is the most common case and where you’ll usually want to start. These events also have details about the recovery action that was taken and the throttling it passed. Recovery actions that start and then fail are still counted against throttling limits. For more information about recovery actions, read the What Did Managed Availability Just Do to This Service? article. Viewing Throttling Limits So what is the best way to find out what recovery action throttling is in place? You could wait for the Responder to begin a recovery action and view the throttling settings in the RecoveryActionsLogs channel, but there are two places that will be more timely. The first is the Microsoft-Exchange-ManagedAvailability\ThrottlingConfig event log channel. The second is the Microsoft-Exchange-ActiveMonitoring/ResponderDefinition channel, introduced in the first section of this artcile. The advantage of the ThrottlingConfig channel is that you can see all the Responders that can take a particular recovery action grouped together, instead of having to check every Responder definition. Here’s a sample event from the ThrottlingConfig event log channel: Identity RestartService/Default/*/*/msexchangefastsearch RecoveryActionId RestartService ResponderCategory Default ResponderTypeName * ResponderName * ResourceName msexchangefastsearch PropertiesXml <ThrottleConfig Enabled="True" LocalMinimumMinutesBetweenAttempts="60" LocalMaximumAllowedAttemptsInOneHour="-1" LocalMaximumAllowedAttemptsInADay="4" GroupMinimumMinutesBetweenAttempts="1" GroupMaximumAllowedAttemptsInADay="-1" /> The Identity of a throttling configuration is a concatenation of the next five fields, so let’s discuss each. The RecoveryActionId is the Responder’s throttling type. You can find this as the name of the ThrottleEntries node in the Responder definition’s ThrottlePolicyXml property. The ResponderCategory is unused and is always Default right now. The ResponderTypeName is the Responder’s TypeName property. The ResourceName is the object the Responder acts on. In this example, the throttling for Responders that use the RestartService recovery action to restart the MSExchangeFastSearch process are allowed on any server up to 4 times a day, as long as it has been 60 minutes since this recovery action has restarted it on that server. The group throttling is not used. The second method to view throttling limits is by the Microsoft-ExchangeActiveMonitoring/ResponderDefinition events. This will include any overrides you have in place. Here is the value of the ThrottlePolicyXml property from a ResponderDefinition event: <ThrottleEntries> <RestartService ResourceName="MSExchangeFastSearch"> <ThrottleConfig Enabled="True" LocalMinimumMinutesBetweenAttempts="60" LocalMaximumAllowedAttemptsInOneHour="-1" LocalMaximumAllowedAttemptsInADay="4" GroupMinimumMinutesBetweenAttempts="-1" GroupMaximumAllowedAttemptsInADay="-1" /> </RestartService> </ThrottleEntries> You can see that these attribute names and values match the ThrottlingConfig event’s PropertiesXml values. Changing Throttling Limits There may be times when you want recovery actions to occur more frequently or less frequently. For example, you have a customer report of an outage and you find that a service restart would have fixed it but was throttled, or you have a third-party application that does particularly poorly with application pool resets. To change the throttling configuration, you can use the same AddServerMonitoringOverride and Add-GlobalMonitoringOverride cmdlets that work for other Managed Availability overrides. The Customizing Managed Availability article gives a good summary on using these cmdlets. For the PropertyName parameter, the cmdlet supports a special syntax for modifying the throttling configuration. Instead of specifying the entire XML blob as the override (which will work, but will be harder to read later), you can use ThrottleAttributes.LocalMinimumMinutesBetweenAttempts, the PropertyName. Here’s an example: or the other properties, as Add-GlobalMonitoringOverride -ItemType Responder -Identity Search\SearchIndexFailureRestartSearchService –PropertyName ThrottleAttributes.LocalMinimumMinutesBetweenAttempts -PropertyValue 240 -ApplyVersion "15.00.1044.025" To only allow app pool resets by the ActiveSyncSelfTestRestartWebAppPool Responder every 2 hours instead of 1, you could use the command: Add-GlobalMonitoringOverride -ItemType Responder -Identity ActiveSync.Protocol\ActiveSyncSelfTestRestartWebAppPool -PropertyName ThrottleAttributes.LocalMinimumMinutesBetweenAttempts -PropertyValue 120 -ApplyVersion “Version 15.0 (Build 1044.25)” If you want you servers to reboot when the MSExchangeIS service crashes and cannot start at the rate of all of your servers once a day and no more often than one in the DAG every 60 minutes, you could use the commands: Add-GlobalMonitoringOverride -ItemType Responder -Identity Store\StoreServiceKillServer PropertyName ThrottleAttributes.GroupMinimumMinutesBetweenAttempts -PropertyValue 60 ApplyVersion “15.00.1044.025” Add-GlobalMonitoringOverride -ItemType Responder -Identity Store\StoreServiceKillServer PropertyName ThrottleAttributes.GroupMaximumAllowedAttemptsInADay -PropertyValue -1 ApplyVersion “15.00.1044.025” The LocalMaximumAllowedAttemptsInADay value is already 1, so each server would still reboot at most once per day. If the override was entered correctly, the ResponderDefinition event’s ThrottlePolicyXml value will be updated, and there will be a new entry in the ThrottlingConfig channel. These may be poor examples, but it is hard to pick good ones as the Exchange developers pick values for the throttling configuration based on our experience running Exchange in Office 365. We don’t expect that changing these values is going to be something you’ll want to do very often, but it is usually a better idea than disabling a monitor or a recovery action altogether. If you do have a scenario where you need to keep a throttling limit override in place, we would love to hear about it. Abram Jackson Program Manager, Exchange Server Tags Exchange 2013 managed availability On premises Tips 'n Tricks Da <https://blogs.technet.microsoft.com/exchange/2015/03/02/managed-availability-responders/> Server Component States in Exchange 2013 Introduction In Exchange 2013 we introduced the concept of “Server Component States”. Server Component State provides granular control over the state of the components that make up an Exchange Server from the point of view of the environment it is running in. This is useful when you want to take an Exchange 2013 Server out of operation partially or completely for a limited time, but still need the Exchange services on the server to be up and running. One example of such a situation is when “Managed Availability” (MA) comes to the conclusion that a specific server is not healthy in some respects and therefore should be bypassed temporarily until the bottleneck has been identified and removed. MA does so by utilizing so-called “Offline Responders”. They are explained in some detail in Lessons from the Datacenter: Managed Availability. Their counterpart are “Online Responders”, which bring the server back online when it is determined as being healthy again. Another example is when a server is being updated, for example with a new CU. In both situations, the server cannot be taken offline completely, but it also should not be considered as a fully operational member of its Exchange organization as well. The primary purpose of Managed Availability is, to make the life of Exchange Administrators easier so that they usually do not have to bother themselves with the details. However, in other situations, a certain level of knowledge about the basic concepts behind “Server Component States” might prove to be useful. Verify the current State of the Server Components A first overview over the current State of all Server Components can be displayed in the Exchange PowerShell with the Get-ServerComponentState –Identity <ServerID> cmdlet: You see, that the Server Components listed here do not map 1:1 to Exchange Services or processes running on the server. Instead, they provide an abstraction-layer and display “Components” which together make up the interfaces an Exchange Server provides to its environment. The majority of the components have a name like “*Proxy”. They are specific for the CAS Role, while other components like “HubTransport” and “UMCallRouter” are part of the Mailbox server role and “Monitoring” and “RecoveryActionsEnabled” belong to both roles. In addition to the single components which can be managed individually, there’s also a component called “ServerWideOffline”, which is used to manage the state of all components together, with the exception of “Monitoring” and “RecoveryActionsEnabled”. For this purpose, “ServerWideOffline” overrides individual settings for all other components. It doesn’t touch “Monitoring” and “RecoveryActionsEnabled” because these two components need to stay active in order to keep MA going. Without them, no “OnlineResponder” could bring “ServerWideOffline” back to “Active” automatically. About States and Requesters Usually, Server Components are in one of two States: “Active” or “Inactive”. A third state, called “Draining”, is only relevant for the Components “FrontendTransport” and “HubTransport”. Whenever the state of a component is supposed to be changed, it has to be done by a “Requester”. For example, the parameter –Requester is mandatory when you run the cmdlet Set-ServerComponentState: There are altogether five “Requesters” defined: HealthAPI  Maintenance  Sidelined  Functional  Deployment Requesters are labels – you can choose any of them when running Set-ServerComponentState. But each Requester is treated and stored individually (more on this in the next section) and you should select the Requester that best matches your intention. For example, when you need to set “ServerWideOffline” to “Inactive” for maintenance purposes, it makes no sense to use “HealthAPI” as Requester. You might get what you want that way in terms of functionality; but such a choice will make troubleshooting unnecessarily complicated in case something does not work as expected, and you might get into a conflict with MA. Whenever MA triggers an OfflineResponder or an OnlineResponder it uses “HealthAPI” as Requester; therefore it is a good idea to consider “HealthAPI” as reserved for use by MA.  Interaction between States and Requesters As stated above, each Requester is handled and stored individually. There’s no relationship or hierarchy amongst them. However, in case of a conflict between two or more Requesters, “Inactive” has a higher priority than “Active”. Here’s a practical example: Imagine that “ServerWideOffline” has been set to “Inactive” by two different Requesters, say “Functional” and “Maintenance”: Then, you set back “ServerWideOffline” to “Active” with one of the two Requesters As a Result, “ServerWideOffline” and all dependent Components still remain in the state “Inactive”: In order to set them back to “Active” again, Set-ServerComponentState … -State Active needs to be executed with the second Requester as well. Obviously, administrators will rarely configure such combinations purposefully. However, we have seen them happening as the result of a mix of processes running in the background and manual configuration. Where is the data stored and how can it be retrieved? Information about “Server Components”, “Requesters” and “States” is stored in two different places: Active Directory and the server’s Registry. Storage in the Active Directory facilitates running SetServerComponentState against a remote server. In order to determine precedence in case of a divergence between the two places, a timestamp is used. The newer setting is considered as the intended one. In Active Directory, the settings are stored in the “msExchComponentStates” attribute of the Exchange Server object in the Configuration-namespace: In the Registry, the settings are stored HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ServerComponentStates: under You can use the Get-ServerComponentState cmdlet from the Shell to retrieve these settings: <variable>.LocalStates displays the settings in the local Registry and <variable>.RemoteStates displays the settings in the Active Directory. Peculiarities of Components “FrontendTransport” and “HubTransport” Most Exchange Server components pick up changes in their Component State “on the fly”. However, this is not the case for the two Components “FrontendTransport” (mapped to the “Microsoft Exchange Frontend Transport” service on CAS Servers) and “HubTransport” (mapped to “Microsoft Exchange Transport” on Mailbox Servers). They pick up changes upon their next restart only. This can cause confusion about their actual state. For example, they might be displayed as “Inactive” in Get-ServerComponentState but functionally still be active since they haven’t been restarted since their state has changed. In order to alert the Administrator about such an inconsistency they write warnings into the Application Event Log. These warnings (7011 for MSExchangeTransport and 7012 for MSExchangeFrontEndTransport) inform the Administrator about the current and the expected state: MA has Responders which take care of such inconsistencies and resolve them after a while by a forced crash and restart of the affected services. These Responders have the name “FrontendTransport.ServiceInconsistentState.Restart.Responder” and “HubTransport.ServiceInconsistentState.Restart.Responder”. They can be identified in the Microsoft.Exchange.ActiveMonitoring/ResponderDefinition crimson channel using the methodology described in What Did Managed Availability Just Do To This Service? blog post. In Exchange 2013 CU2 and CU3 they are only triggered once per day for standalone Mailbox Servers and up to four times per day for DAG members. This might change in future versions, though. After the inconsistencies have been cleared, Events 7009 from Source “MSExchangeTransport” (or “MSExchangeFrontendTransport”) and Category “Components” are logged, which show the current state: When the state of one or both of these components is set to “Inactive”, each attempt to connect to the SMTP service on the server (on TCP port 25) triggers the response “421 4.3.2 Service not active” (for the FrontendTransport component) and “451 4.7.0 Temporary Server error” (for the HubTransport component) and corresponding entries are written to the respective SmtpReceive Protocol Logs. As mentioned in KB 2866822 , messages sent from internal stay in the Outbox or Drafts folder and “Service not active” entries are logged in the “Connect*”-Logs in the “Submission”-Folder underneath “TransportRoles\Logs\Mailbox\Connectivity\” when “HubTransport” is set to “Inactive”. A problem with failing updates to CU2 While an Exchange 2013 Server is updated with CU2, the setup- sets “Monitoring”, “RecoveryActionsEnabled” and “ServerWideOffline” to Inactive using the Requester “Functional” at the beginning, as can be seen in the “ExchangeSetup”-Logfile: However, when the update exits prematurely because it encounters an unrecoverable error-condition, it does not restore the original state. Even when the Administrator restarts all stopped Exchange services or reboots the server, the Exchange components still remain in the Inactive state. In order to recover from this situation, you must either find the root cause for the error and remove it so that the setup completes successfully, or manually set the ServerComponentStates back to Active with the Requester “Functional”. This issue might be fixed in future CUs and SPs. When should you change Server Component States manually? Perhaps the two most important scenarios where manual changes of Server Component States come into play are: 1. Planned server maintenance 2. Temporary isolation of some server components, so that they are not targets for proxy requests from other CAS servers any more. Server component states and planned maintenance of DAG members Scenario 1 is described for DAG-members in some detail in Managing Database Availability Groups on TechNet. However, in practice, it can prove to be trickier than expected, depending on the details of the planned maintenance measure. The article does not mention that “FrontendTransport” and “HubTransport” have to be restarted, before a change in their Component State becomes effective at all. Without a restart, you continue to receive warning events 7011 and 7012 in the Application Event Log, but the state of the components remains the same as before, until eventually MA detects the inconsistency and force restarts the services. Also, the problem with the prematurely exiting CU setups mentioned above in the section “Practical Experiences” cannot be resolved by following the recommendations in the article to the dot, because Setup uses the Requester “Functional”, while the article talks about the Requester “Maintenance”. Special thanks to Bhalchandra Atre, Stephen Gilbert for their contributions as well as Abram Jackson and Bharat Suneja for their review. Da <https://blogs.technet.microsoft.com/exchange/2013/09/26/server-component-states-in-exchange-2013/> Exchange Server 2013 Managed Availability When Microsoft introduced Managed Availability in Exchange Server 2013, its appearance sowed quite a bit of confusion in the Exchange world. The idea behind it—an automated system of health monitoring that would watch critical components of your Exchange infrastructure and automatically take corrective action to fix problems as they occurred—sounded great, but at its launch, Managed Availability was poorly documented and largely misunderstood. Now that Exchange 2013 has been out in the field for a while, both Microsoft and its customers are getting more operational experience with Managed Availability. Understanding how it works and why it works that way will help you understand how Managed Availability will affect your operating procedures and how to manage it to get the desired outcomes. Note that you'll sometimes see references to Active Monitoring (AM) and Local Active Monitoring (LAM) in the Managed Availability world. They're functional descriptions of the feature, not real names, but the acronyms haven't been completely removed from the code base, event log messages, and so on. Defining the Data Center Downward Microsoft, IBM, and many other enterprise-focused companies have long sought to build systems—in the form of hardware, operating systems, and applications—that are resilient against failure or damage. The goal of these efforts has been to bring mainframe-quality uptime to enterprise applications without requiring the overhead and infrastructure required by these traditional systems. We've all reaped the benefits. The redundant server hardware that's now almost a commodity used to be found only in extremely demanding, budget-insensitive applications such as spaceflight, telephone switching, and industrial control systems. Likewise, applications such as Microsoft SQL Server, Oracle's database applications, and Microsoft Exchange have steadily gained more resiliency-focused features, including clustering, transactional database logging, and a variety of application-specific protection methods (e.g., Safety Net and shadow redundancy in Exchange). In general, these features focus on detecting certain types of failures and automatically taking action to resolve them, such as activating a database copy on another server. However, the next logical step in building resiliency into Exchange required a departure from the previous means of doing so. Exchange needed more visibility into more components of the system, as well as an expanded set of actions that it can take. Advanced service monitoring relies on three complementary tasks. It has to monitor the state of every interesting component, decide whether the data returned from monitoring indicates some type of problem, then act to resolve the problem. This monitor-decide-act process has long been the province of human administrators. You notice that something is wrong (perhaps as the result of a user report or your own monitoring), you figure out what the problem is, then take one or more actions to fix the problem. However, having humans responsible for that process doesn't scale well to very large environments, such as the Exchange Online portion of Microsoft Office 365. Plus, it's tiresome for the unlucky administrator who gets stuck having to fix problems on weekends and holidays. To address these shortcomings and give Exchange more capability to self-diagnose and self-repair, Microsoft delivered Managed Availability as part of Exchange 2013. Understanding Managed Availability's Logical Design Managed Availability is designed around three logical components: probe, monitor, and responder. The probe runs tests against different aspects of Exchange. These tests can be performance based (e.g., how long it takes for a logon transaction in Outlook Web App—OWA—to work), health based (e.g., whether a particular service is currently running), or exception based (e.g., a monitored component generated a bug check or another unusual event that indicates an immediate problem). In Microsoft's words, the monitor "contains all of the business logic used by the system based on what is considered healthy on the data collected." This is a fancy way of saying that the monitor is responsible for interpreting data gathered by the probe to determine whether any action is required. Because the monitor's behavior is specified by the same developers who wrote the code for each monitored component, the monitor has intimate knowledge of how to tell whether a particular component is healthy. Managed Availability uses the terms "healthy" and "unhealthy" in pretty much the same sense as people do. If a component is healthy, its performance and function are normal. An unhealthy component is one that the monitor has decided isn't working properly. In addition to the basic healthy and unhealthy states used by the monitor, there are other states that might appear when you check the state of a server. (I'll discuss those later.) The responder takes action based on what the monitor finds. These actions can range from restarting a service to running a bug check on a server (thus forcing it to reboot) to logging an event in the event log. Although logging an event might not seem like a very forceful step for the responder to take, the idea is that the logged event will be picked up by a monitoring system such as Microsoft System Center Operations Manager or Dell OpenManage, thus alerting a human, who can then take some kind of action. Understanding Managed Availability's Physical Design Managed Availability is implemented by a pair of services that you'll find on every Exchange 2013 server: MSExchangeHMWorker.exe and MSExchangeHMHost.exe. MSExchangeHMWorker.exe is the worker process that does the actual monitoring. MSExchangeHMHost.exe (shown as the Exchange Health Manager Service in the Services snap-in) manages the worker process. The pairing of a controller service with one or more worker processes is used throughout other parts of Exchange, such as the Unified Messaging (UM) feature. You don't necessarily need to monitor the state of the worker processes on your servers. However, you should monitor MSExchangeHMHost because if MSExchangeHMHost isn't running, then no health monitoring will be performed on that server. Microsoft doesn't support disabling the Managed Availability subsystem by stopping the MSExchangeHMHost process. If Managed Availability is doing something you don't like, you should tame it with the techniques I'll show you instead of turning it off completely. The configuration settings for Managed Availability live in multiple places. This can be a little confusing until you get used to it. The local settings for an individual server are stored in the registry at HKLM\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides. When you override a monitor on an individual server (which I'll discussed later), the override settings are stored in that server's registry. Other settings, notably global overrides, are stored in Active Directory (AD) in the Monitoring Settings container (CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft Exchange,CN=Services,CN=Configuration…) All the standard caveats about the need for healthy AD replication thus apply to the proper functioning of Managed Availability. Running Your First Health Check The simplest way to start understanding Managed Availability is to use the Get-HealthReport cmdlet. You can use it to learn about the states of all the health sets on the specified server. A health set is a group of probes, monitors, and responders for a component. For example, if you want to check the health sets for the server named WBSEXMR02, you'd run the command: Get-HealthReport -Identity WBSEXMR02 When you run this command, you'll see output similar to that shown in Figure 1. The deluge of information will probably tell you more than you wanted to know about what Managed Availability thinks of the target server. Figure 1: Obtaining the States of All the Health Sets on the Current Server An Exchange component such as the Exchange ActiveSync (EAS) subsystem might have multiple health sets associated with it, and each health set might contain multiple probes, monitors, and responders that assess different aspects of items in the health set. One great example is OutlookRpcSelfTestProbe, which has one probe for each mailbox database on the server. All of those probes roll up into the Outlook.Protocol health set. Each reported health set includes a state (indicating whether it's online, offline, or has no state associated with it), an alert value (indicating whether it's considered healthy or not), and the last time the health state changed. Understanding the Health Set States It's useful to know the states that a health set can be in. Obviously, "healthy" is the preferred state. When you see this state, it means that Managed Availability hasn't spotted anything wrong with the components monitored in that health set. When a health set is shown as "unhealthy," it means that one or more items monitored by that health set are broken in some way. As far as Managed Availability is concerned, these are the two most important states. If a health set is healthy, it will be left alone. If it's unhealthy, the responders will be engaged in the order specified by the developers until either the health set becomes healthy again or the system runs out of responses and escalates the issue by notifying an administrator about the problem. There are four additional states that you might see when examining health reports:     Degraded. The "degraded" state means that the monitored item has been unhealthy for less than 60 seconds. If it's not fixed by the 61st second, its status will change to "unhealthy." Disabled. The "disabled" state appears when you manually disable a monitor. Unavailable. The "unavailable" state appears when a monitor doesn't respond to queries from the health service. This is seldom a good thing and warrants investigation as soon as you see it. Repairing. The "repairing" state only appears when you set it as the state for a health set. It tells Managed Availability that you're aware of, and are fixing, problems with that particular component. For example, a mailbox database that you're reseeding would be labeled as unhealthy, so you'd set its status to "repairing" while you're working on it so that Managed Availability is aware that the failure is being addressed. To indicate that you're manually fixing a problem, you use the Set-ServerMonitor cmdlet. For example, if you need to make repairs on a server named HSVEX01, you'd run the command: Set-ServerMonitor -Server HSVEX01 -Name Maintenance ` -Repairing $true When you're done with the repairs, you'd run Set-ServerMonitor again, setting the -Repairing parameter to $false. Checking and Setting States Components are logical objects that might contain multiple services or other objects. For example, the FrontendTransport component is made up of multiple subcomponents, which you can see using the Get-MonitoringItemIdentity cmdlet: Get-MonitoringItemIdentity -Identity FrontendTransport However, you can't manage those subcomponents as individual items. Instead, you need to use the GetServerComponentState and Set-ServerComponentState cmdlets on the component as a whole. With the Get-ServerComponentState cmdlet, you can get a list of the components that Exchange recognizes and those components' states. You use the -Identity parameter to specify the server. For example, the following command lists the components and their states for the server named WBSEXMR01: Get-ServerComponentState WBSEXMR01 Figure 2 shows the results. If you want to see the state of a specific component, you can add the Component parameter followed by the component's name. Figure 2: Obtaining the States of All the Components on the Specified Server You can use the Set-ServerComponentState component to manually change the states of individual components on a server. There are many reasons why you might want to do this, including preparing servers for maintenance or upgrade, or disabling a component that you don't want running on a particular server. To use Set-ServerComponentState, you must include four parameters:  The -Component parameter. You use this parameter to specify the component whose state you're changing. You can specify the name of a subsystem or component (e.g., UMCallRouter, HubTransport) or the special value ServerWideOffline, which indicates that you want to change the state of all the components on the specified server.  The -Identity parameter. You use this parameter to specify the name of the server on which you want to change the component state.  The -Requester parameter. You use this parameter to specify why you're changing the state. Typically, you'll be specifying the value of Maintenance. The other possible values are HealthAPI, Sidelined, Functional, and Deployment.  The -State parameter. You use this parameter to specify the desired state. It can be set to Active or Inactive for all components. Some components support a third state, Draining. If you set a component's state to Draining, it indicates that the component should finish processing existing connections, but it shouldn't accept or make new connections. For example, putting the HubTransport component into Draining state allows it to finish handling existing SMTP conversations, but it won't be able to participate in new conversations. Although you can run Set-ServerComponentState any time, the most common reason to do so is when you want to tell Managed Availability that you are starting or stopping planned maintenance on a server. Doing so reduces the risk that Managed Availability will change the state of a component while you're in the middle of working on it. Preparing for Maintenance In Exchange 2010, you typically use the StartDagServerMaintenance.ps1 script to indicate that you're going to do maintenance on a database availability group (DAG) member server. The process for performing maintenance on an Exchange 2013 DAG member is a little bit different. Here's how you'd put a server named DWHDAG01 into maintenance mode: 1. Drain the transport queues by running the command: Set-ServerComponentState DWHDAG01 ` -Component HubTransport -State Draining ` -Requester Maintenance 2. If your server is being used as a UM server, drain the UM calls by running the command: Set-ServerComponentState DWHDAG01 ` -Component UMCallRouter -State Draining ` -Requester Maintenance 3. Put the server in maintenance mode by running the command: Set-ServerComponentState DWHDAG01 ` -Component ServerWideOffline -State Inactive ` -Requester Maintenance When you're done, you'd run Set-ServerComponentState on the same components but in reverse order (ServerWideOffline, UMCallRouter, then HubTransport), putting them into the Active state. Turning Off a Service or Component Microsoft tests Exchange as a complete set of services, so it doesn't necessarily support turning off individual services, but sometimes you might need to do so anyway. Perhaps you want to troubleshoot some aspect of your server's behavior, or you want to turn off services that you know you won't be using. (The UM services sometimes meet this fate.) If you use the standard Windows service management tools to stop an Exchange service, Managed Availability will see that as a failure and try to turn the service back on, working through its responders as designed. The responders for a component might cause the server to reboot or run a bug check, which could cause problems. To avoid this situation, you could disable the service using Service Control Manager (SCM), but then Managed Availability will become unhappy and report that the server's health is poor. The best option is to turn off the managed service using the Set-ServerComponentState cmdlet. For example, if you want to turn off the RecoveryActionsEnabled service on the DWHDAG01 server, you'd run the command: Set-ServerComponentState -Component RecoveryActionsEnabled ` -Identity DWHDAG01 -State Inactive -Requester Functional Using Overrides Managed Availability implements a sort of get-out-of-jail-free card in the form of overrides. An override allows you to change the thresholds used by the monitor for determining whether a particular component is healthy or change the action taken by the responder when a component becomes unhealthy. Typically, you won't have to do this. However, there might be cases when it's necessary. For example, when Microsoft shipped Cumulative Update 3 (CU3) for Exchange 2013, it added a new probe for public folder access through Exchange Web Services that would cause the public folder subsystem to be marked as unhealthy if you didn't have any public folders in Exchange 2013. The fix, as described in the Microsoft Support article "PublicFolders health set is "Unhealthy" after you install Exchange Server 2013 Cumulative Update 3," is to add an override to tell Managed Availability to stop caring about that particular probe. There are actually two types of overrides: server overrides, which apply to a single server, and global overrides, which apply to all servers in the Exchange organization. You apply overrides using the AddServerMonitoringOverride or Add-GlobalMonitoringOverride cmdlet. Here are some examples from the Microsoft Support article: Add-GlobalMonitoringOverride -Identity ` "Publicfolders\PublicFolderLocalEWSLogonEscalate" ` -ItemType "Responder"-PropertyName Enabled ` -PropertyValue 0 -ApplyVersion "15.0.775.38" Add-GlobalMonitoringOverride -Identity ` "Publicfolders\PublicFolderLocalEWSLogonMonitor" ` -ItemType "Monitor" -PropertyName Enabled ` -PropertyValue 0 -ApplyVersion "15.0.775.38" Add-GlobalMonitoringOverride -Identity ` "Publicfolders\PublicFolderLocalEWSLogonProbe" ` -ItemType "Probe" -PropertyName Enabled ` -PropertyValue 0 -ApplyVersion "15.0.775.38" Note that the cmdlet appears three times with the same value for the -Identity parameter: once to disable the responder, once to disable the monitor, and once to disable the probe for the specified object. Dealing with Occasional Oddities in Managed Availability Managed Availability is a complex subsystem, and you might find that it occasionally behaves in ways you don't expect. For example, when you set the state of the FrontendTransport and HubTransport components to Inactive or Draining, you might notice that you're still seeing event log IDs 7011 and 7012, indicating that Managed Availability thinks those services are down. Managed Availability will eventually trigger a responder that restarts the services for you, or you can manually restart them to restore the correct behavior. It's also sometimes the case that other operations confuse Managed Availability so that it isn't aware of, or doesn't report, the correct state of the items it monitors. For example, installing Exchange 2013 CU3 would sometimes make monitored services incorrectly appear to be unhealthy. These problems are usually easy to fix by restarting the affected service or using SetServerComponentState. You also have the ability to create overrides for cases when you have no other easy way to fix the problem. Problems like these are pretty rare, and they don't detract from the utility of Managed Availability over the long term. The Future of Managed Availability Managed Availability has a huge amount of potential because of the nature of its design. The people who wrote the code that makes up Exchange also wrote the Managed Availability probes, monitors, and responders that monitor it, so as the Exchange code evolves and changes, Managed Availability can keep pace. The idea of having a self-monitoring, self-healing Exchange server is an attractive one, although in its current implementation it's limited to watching individual servers. The existing UI for Managed Availability relies on the Exchange Management Shell (EMS), and that might not change, although hopefully we'll see better integration between the monitoring tools available in the Exchange Admin Center (EAC) and the health reports generated by Managed Availability. As Office 365 expands in scale, Microsoft is likely to continue investing in Managed Availability's ability to correlate health states across larger parts of the Exchange infrastructure, and those improvements will move over to onpremises implementations, too. Da <http://windowsitpro.com/exchange-server-2013/exchange-server-2013-managed-availability> New Managed Availability Documentation Available Ensuring that users have a good email experience has always been the primary objective for messaging system administrators. To help ensure the availability and reliability of your messaging system, all aspects of the system must be actively monitored, and any detected issues must be resolved quickly. In previous versions of Exchange, monitoring critical system components often involved using an external application such as Microsoft System Center 2012 Operations Manager to collect data, and to provide recovery action for problems detected as a result of analyzing the collected data. Exchange 2010 and previous versions included health manifests and correlation engines in the form of management packs. These components enabled Operations Manager to make a determination as to whether a particular component was healthy or unhealthy. In addition, Operations Manager also used the diagnostic cmdlet infrastructure built into Exchange 2010 to run synthetic transactions against various aspects of the system. Exchange 2013 takes a new approach to monitoring and preserving the end user experience natively using a feature called Managed Availability that provides built-in monitoring and recovery actions. Overview Managed availability, also known as Active Monitoring or Local Active Monitoring, is the integration of built-in monitoring and recovery actions with the Exchange high availability platform. It’s designed to detect and recover from problems as soon as they occur and are discovered by the system. Unlike previous external monitoring solutions and techniques for Exchange, managed availability doesn’t try to identify or communicate the root cause of an issue. It’s instead focused on recovery aspects that address three key areas of the user experience: Availability Can users access the service? Latency How is the experience for users? Errors Are users able to accomplish what they want? Managed availability is an internal process that runs on every Exchange 2013 server. It polls and analyzes hundreds of health metrics every second. If something is found to be wrong, most of the time it will be fixed automatically. But there will always be issues that managed availability won’t be able to fix on its own. In those cases, managed availability will escalate the issue to an administrator by means of event logging.    For more information about this new feature, see the newly published topic Managed Availability. Health Sets From a reporting perspective, managed availability has two views of health, one internal and one external. The internal view uses health sets. Each component in Exchange 2013 (for example, Outlook Web App, Exchange ActiveSync, the Information Store service, content indexing, transport services, etc.) is monitored by managed availability using probes, monitors, and responders. A group of probes, monitors and responders for a given component is called a health set. A health set is a group of probes, monitors, and responders that determine if that component is healthy. The current state of a health set (e.g., whether it is healthy or unhealthy) is determined by using the state of the health set’s monitors. If all of a health set’s monitors are healthy, then the health set is in a healthy state. If any monitor is not in a healthy state, then the health set state will be determined by its least healthy monitor. For detailed steps to view server health or health sets state, see the newly published topic Manage Health Sets and Server Health. For information on troubleshooting health sets, see this topic. Health Groups The external view of managed availability is composed of health groups. Health groups are exposed to System Center Operations Manager 2007 R2 and System Center Operations Manager 2012. There are four primary health groups:   Customer Touch Points Components that affect real-time user interactions, such as protocols, or the Information Store Service Components Components without direct, real-time user interactions, such as the Microsoft Exchange Mailbox Replication service, or the offline address book generation process (OABGen) Server Components The physical resources of the server, such as disk space, memory and networking Dependency Availability The server’s ability to access necessary dependencies, such as Active Directory, DNS, etc. When the Exchange 2013 Management Pack is installed, System Center Operations Manager (SCOM) acts as a health portal for viewing information related to the Exchange environment. The SCOM dashboard includes three views of Exchange server health:   1. Active Alerts Escalation Responders write events to the Windows event log that are consumed by the monitor within SCOM. These appear as alerts in the Active Alerts view. 2. Organization Health A rollup summary of the overall health of the Exchange organization health is displayed in this view. These rollups include displaying health for individual database availability groups, and health within specific Active Directory sites. 3. Server Health Related health sets are combined into health groups and summarized in this view. Overrides Overrides provide an administrator with the ability to configure some aspects of the managed availability probes, monitors, and responders. Overrides can be used to fine tune some of the thresholds used by managed availability. They can also be used to enable emergency actions for unexpected events that may require configuration settings that are different from the out-of-box defaults. Overrides can be created and applied to a single server (this is known as a server override), or they can be applied to a group of servers (this is known as a global override). Server override configuration data is stored in the Windows registry on the server on which the override is applied. Global override configuration data is stored in Active Directory. Overrides can be configured to last indefinitely, or they can be configured for a specific duration. In addition, global overrides can be configured to apply to all servers, or only servers running a specific version of Exchange. For detailed steps to view or configure server or global overrides, see Configure Managed Availability Overrides. When you configure an override, it will not take effect immediately. The Microsoft Exchange Health Manager service checks for updated configuration data every 10 minutes. In addition, global overrides will be dependent on Active Directory replication latency. Below are some examples of adding and removing global and server overrides: Example 1 – Make Information Store maintenance assistant alerts non-urgent for 60 days: Add-GlobalMonitoringOverride -Identity Store\MaintenanceAssistantEscalate -ItemType Responder -PropertyName NotificationServiceClass -PropertyValue 1 -Duration 60.00:00:00 Example 2 – Change the maintenance assistant monitor to look for 32 hours of failures for 30 days: Add-GlobalMonitoringOverride -Identity Store\DirectoryServiceAndStoreMaintenanceAssistantMonitor -ItemType Monitor PropertyName MonitoringIntervalSeconds -PropertyValue 115200 -Duration 30.00:00:00 Example 3 – Remove the maintenance assistant monitor override added in Example 2: Remove-GlobalMonitoringOverride -Identity Store\DirectoryServiceAndStoreMaintenanceAssistantMonitor -ItemType Monitor PropertyName MonitoringIntervalSeconds Example 4 – Remove the Information Store maintenance assistant alerts non-urgent override added in Example 1: Remove-GlobalMonitoringOverride -Identity Store\MaintenanceAssistantEscalate -ItemType Responder -PropertyName NotificationServiceClass Example 5 – Apply the database repeatedly mounting threshold override (change to 60 minutes) for a period of 60 days: Add-GlobalMonitoringOverride -Identity Store\DatabaseRepeatedMountsMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds -PropertyValue 3600 -Duration 60.00:00:00 Example 6 – Remove the database repeatedly mounting threshold override added in Example 5: Remove-GlobalMonitoringOverride -Identity Store\DatabaseRepeatedMountsMonitor ItemType Monitor -PropertyName MonitoringIntervalSeconds Example 7 – Change the database dismounted alert from HA to Store for a period of 7 days: Add-GlobalMonitoringOverride -Identity Store\DatabaseAvailabilityEscalate -ItemType Responder -PropertyName ExtensionAttributes.Microsoft.Mapi.MapiExceptionMdbOffline PropertyValue Store -Duration 7.00:00:00 Example 8 – Disable VersionBucketsAllocated monitor for a period of 60 days: Add-GlobalMonitoringOverride -Identity Store\VersionBucketsAllocatedMonitor -ItemType Monitor -PropertyName Enabled -PropertyValue 0 -Duration 60.00:00:00 Example 9 – Update logs threshold in DatabaseSize monitor for a period of 60 days: Add-GlobalMonitoringOverride -Identity MailboxSpace\DatabaseSizeMonitor -ItemType Monitor -PropertyName ExtensionAttributes.DatabaseLogsThreshold -PropertyValue 100GB Duration 60.00:00:00 Example 10 – Applying a server override to disable quarantine monitor across all database copies for a period of 7 days: (get-mailboxDatabase <DB Name>).servers | %{Add-ServerMonitoringOverride -Server $_.name -Identity "Store\MailboxQuarantinedMonitor\<DB Name>" -ItemType Monitor PropertyName Enabled -PropertyValue 0 -Duration:7.00:00:00 -Confirm:$false;} Management Tasks and Cmdlets There are three primary operational tasks that administrators will typically perform with respect to managed availability: Extracting or viewing system health Viewing health sets, and details about probes, monitors and responders Managing overrides The two primary management tools for managed availability are the Windows Event Log and the Shell. Managed availability logs a large amount of information in the Exchange ActiveMonitoring and ManagedAvailability crimson channel event logs, such as:    Probe, monitor, and responder definitions, which are logged in the respective *Definition event logs. Probe, monitor, and responder results, which are logged in the respective *Results event logs. Details about responder recovery actions, including when the recovery action is started, and it is considered complete (whether successful or not), which are logged in the RecoveryActionResults event log. There are 12 cmdlets used for managed availability, which are described in the following table.    Cmdlet Description Get-ServerHealth Used to get raw server health information, such as health sets and their current state (healthy or unhealthy), health set monitors, server components, target resources for probes, and timestamps related to probe or monitor start or stop times, and state transition times. Get-HealthReport Used to get a summary health view that includes health sets and their current state. Get-MonitoringItemIdentity Used to view the probes, monitors, and responders associated with a specific health set. Get-MonitoringItemHelp Used to view descriptions about some of the properties of probes, monitors, and responders. Add-ServerMonitoringOverride Used to create a local, server-specific override of a probe, monitor, or responder. Get-ServerMonitoringOverride Used to view a list of local overrides on the specified server. RemoveServerMonitoringOverride Used to remove a local override from a specific server. Add-GlobalMonitoringOverride Used to create a global override for a group of servers. Get-GlobalMonitoringOverride Used to view a list of global overrides configured in the organization. RemoveGlobalMonitoringOverride Used to remove a global override. Set-ServerComponentState Used to configure the state of one or more server components. Get-ServerComponentState Used to view the state of one or more server components. Da <https://blogs.technet.microsoft.com/scottschnoll/2013/11/25/new-managed-availability-documentation-available/> Responding to Managed Availability I’ve written a few blog posts now that get into the deep technical details of Managed Availability. I hope you’ve liked them, and I’m not about to stop! However, I’ve gotten a lot of feedback that we also need some simpler overview articles. Fortunately, we’ve just completed documentation on TechNet with an overview of Managed Availability. This was written to address how the feature may be managed day-today. Even that documentation doesn’t address how you respond when Managed Availability cannot resolve a problem on its own. This is the very most common interaction with Managed Availability, but we haven’t described how specifically to do so. When Managed Availability is unable to recover the health of a server, it logs an event. Exchange Server has a long history of logging warning, error, and critical events into various channels when things go wrong. However, there are two things about Managed Availability events that make them more generally useful than our other error events: They all go to the same place on a server without any clutter  They will only be logged when the standard recovery actions fail to restore health of the component When one of these events is logged on any server in our datacenters, a member of the product group team responsible for that health set gets an immediate phone call.  No one likes to wake up at 2 AM to investigate and fix a problem with a server. This keeps us motivated to only have Managed Availability alerts for problems that really matter, and also to eliminate the cause of the alert by fixing underlying code bugs or automating the recovery. At the same time, there is nothing worse than finding out about incidents from customer calls to support. Every time that happens we have painful meetings about how we should have detected the condition first and woken someone up. These two conflicting forces strongly motivate the entire engineering team to keep these events accurate and useful. The GUI Along with a phone call, the on-call engineer receives an email with some information about the failure. The contents of this email are pulled from the event’s description. The path in Event Viewer for these events is Microsoft-Exchange-ManagedAvailability/Monitoring. Error event 4 means that a health set has failed and gives the details of the monitor that has detected the failure. Information event 1 means that all monitors of a health set have become healthy. The Exchange 2013 Management Pack for System Center Operations Manager nicely shows only the health sets that are currently failed instead of the Event Viewer’s method of displaying all health sets that have ever failed. SCOM will also roll up health sets into four primary health groups or three views. The Shell This wouldn’t be EHLO without some in-depth PowerShell scripts. The event viewer is nice and SCOM is great, but not everyone has SCOM. It would be pretty sweet to get the same behavior as SCOM to show only the health sets on a server that are currently failed. Note: these logs serve a slightly different purpose than Get-HealthReport. Get-HealthReport shows the current health state of all of a server’s monitors. On the other hand, events are only logged in this channel once all the recovery actions for that monitor have been exhausted without fixing the problem. Also know that these events detail the failure. If you’re only going to take action based on one health metric, the events in this log is a better one. Get-HealthReport is still the best tool to show you the up-to-theminute user experience. We have a sample script that can help you with this; it is commented in a way that you can see what we were trying to accomplish. You can get the Get-ManagedAvailabilityAlerts.ps1 script here. Either this method or Event Viewer will work pretty well for a handful of servers. If you have tens or hundreds of servers, we really recommend investing in SCOM or another robust and scalable eventcollection system. My other posts have dug deeply into troubleshooting difficult problems, and how Managed Availability gives an overwhelmingly immense amount of information about a server’s health. We rarely need to use these troubleshooting methods when running our datacenters. However, the only thing you need to resolve Exchange problems the way we do in Office 365 is a little bit of event viewer or scheduled script. Abram Jackson Program Manager, Exchange Server Tags Administration Exchange 2013 High Availability On premises Scripting Tips 'n Tricks Troubleshooting Da <https://blogs.technet.microsoft.com/exchange/2013/12/16/responding-to-managed-availability/> Managed Availability Managed Availability Managed availability, also known as Active Monitoring or Local Active Monitoring, is the integration of built-in monitoring and recovery actions with the Exchange high availability platform. It's designed to detect and recover from problems as soon as they occur and are discovered by the system. Unlike previous external monitoring solutions and techniques for Exchange, managed availability doesn't try to identify or communicate the root cause of an issue. It's instead focused on recovery aspects that address three key areas of the user experience:  Availability Can users access the service?  Latency How is the experience for users?  Errors Are users able to accomplish what they want? Managed availability provides a native health monitoring and recovery solution. It moves away from monitoring individual separate slices of the system to monitoring the end-to-end user experience, and protecting the end user's experience through recovery-oriented actions. Managed availability is an internal process that runs on every Exchange 2016 server. It polls and analyzes hundreds of health metrics every second. If something is found to be wrong, most of the time it will be fixed automatically. But there will always be issues that managed availability won’t be able to fix on its own. In those cases, managed availability will escalate the issue to an administrator by means of event logging. Managed availability is implemented in the form of two services:  Exchange Health Manager Service (MSExchangeHMHost.exe) This is a controller process used to manage worker processes. It's used to build, execute, and start and stop the worker process, as needed. It's also used to recover the worker process in case that process fails, to prevent the worker process from being a single point of failure.  Exchange Health Manager Worker process (MSExchangeHMWorker.exe) This is the worker process responsible for performing run-time tasks within the managed availability framework. Managed availability uses persistent storage to perform its functions:  XML files in the \bin\Monitoring\config folder are used to store configuration settings for some of the probe and monitor work items.  Active Directory is used to store global overrides.  The Windows registry is used to store run-time data, such as bookmarks, and local (server-specific) overrides.  The Windows crimson channel event log infrastructure is used to store the work item results.  Health mailboxes are used for probe activity. Multiple health mailboxes will be created on each mailbox database that exists on the server. Managed Availability Components As illustrated in the following drawing, managed availability includes three main asynchronous components that are constantly doing work. Managed Availability Components Probes The first component is called a Probe. Probes are responsible for taking measurements on the server and collecting data. There are three primary categories of probes: recurrent probes, notifications, and checks. Recurrent probes are synthetic transactions performed by the system to test the end-to-end user experience. Checks are the infrastructure that perform the collection of performance data, including user traffic. Checks also measure the collected data against thresholds that are set to determine spikes in user failures, which enable the checks infrastructure to become aware when users are experiencing issues. Finally, the notification logic enables the system to take action immediately, based on a critical event, and without having to wait for the results of the data collected by a probe. These are typically exceptions or conditions that can be detected and recognized without a large sample set. Recurrent probes run every few minutes and evaluate some aspect of service health. These probes might transmit an email via Exchange ActiveSync to a monitoring mailbox, they might connect to an RPC endpoint, or they might verify Client Accessto-Mailbox connectivity. All probes are defined on Health Manager service startup in the Microsoft.Exchange.ActiveMonitoring\ProbeDefinition crimson channel. Each probe definitions has many properties, but the most relevant properties are:  Name The name of the probe, which begins with a SampleMask of the probe’s monitor.  TypeName The code object type of the probe that contains the probe’s logic.  ServiceName The name of the health set that contains this probe.  TargetResource The object the probe is validating. This is appended to the name of the probe when it is executed to become a probe result ResultName  RecurrenceIntervalSeconds How often the probe executes.  TimeoutSeconds How long the probe will wait before failing. There are hundreds of recurrent probes. Many of these probes are per-database, so as the number of databases increases, so does the number of probes. Most probes are defined in code and are therefore not directly discoverable. The basics of a recurrent probe are as follows: start every RecurrenceIntervalSeconds and check (or probe) some aspect of health. If the component is healthy, the probe passes and writes an informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult channel with a ResultTypeof 3. If the check fails or times out, the probe fails and writes an error event to the same channel. A ResultType of 4 means the check failed and a ResultType of 1 means that it timed out. Many probes will re-run if they timeout, up to the value of the MaxRetryAttempts property. Nota: The ProbeResult crimson channel can get very busy with hundreds of probes running every few minutes and logging an event, so there can be a real impact on the performance of your Exchange server if you try expensive queries against the event logs in a production environment. Notifications are probes that are not run by the health manager framework, but by some other service on the server. These services perform their own monitoring, and then feed their data into the Managed Availability framework by directly writing probe results. You won’t see these probes in the ProbeDefinition channel, as this channel only describes probes that will be run by the Managed Availability framework. For example, the ServerOneCopyMonitor Monitor is triggered by probe results written by the MSExchangeDAGMgmt service. This service performs its own monitoring, determines whether there is a problem, and logs a probe result. Most notification probes have the capability to log both a red event that turns the monitor unhealthy and a green event that makes the monitor healthy again. Checks are probes that only log events when a performance counter passes above or below a defined threshold. They are really a special case of notification probes, as there is a service monitoring the performance counters on the server and logging events to the ProbeResult channel when the configured threshold is met. To find the counter and threshold that is considered unhealthy, you can look at the monitor for this check. Monitors of the type Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueAboveThresholdMonitor or Microsoft.Office. Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueBelowThresholdMonitor mean that the probe they watch is a check probe Monitor The results of the measurements collected by probes flow into the second component, the Monitor. The monitor contains all of the business logic used by the system on the data collected. Similar to a pattern recognition engine, the monitor looks for the various different patterns on all the collected measurements, and then it decides whether something is considered healthy. Monitors query the data to determine if action needs to be taken based on a predefined rule set. Depending on the rule or the nature of the issue, a monitor can either initiate a responder or escalate the issue to a human via an event log entry. In addition, monitors define how much time after a failure that a responder is executed, as well as the workflow of the recovery action. Monitors have various states. From a system state perspective, monitors have two states:  Healthy The monitor is operating properly and all collected metrics are within normal operating parameters.  Unhealthy The monitor isn't healthy and has either initiated recovery through a responder or notified an administrator through escalation. From an administrative perspective, monitors have additional states that appear in the Shell:  Degraded When a monitor is in an unhealthy state from 0 through 60 seconds, it's considered Degraded. If a monitor is unhealthy for more than 60 seconds, it is considered Unhealthy.  Disabled The monitor has been explicitly disabled by an administrator.  Unavailable The Exchange Health service periodically queries each monitor for its state. If it doesn't get a response to the query, the monitor state becomes Unavailable.  Repairing An administrator sets the Repairing state to indicate to the system that corrective action is in process by a human, which allows the system and humans to differentiate between other failures that may occur at the same time corrective action is being taken (such as a database copy reseed operation). Every monitor has a SampleMask property in its definition. As the monitor executes, it looks for events in the ProbeResult channel that have a ResultName that matches the monitor’s SampleMask. These events could be from recurrent probes, notifications, or checks. If the monitor’s thresholds are achieved, it becomes Unhealthy. From the monitor’s perspective, all three probe types are the same as they each log to the ProbeResult channel. It is worth noting that a single probe failure does not necessarily indicate that something is wrong with the server. It is the design of monitors to correctly identify when there is a real problem that needs fixing. This is why many monitors have thresholds of multiple probe failures before becoming Unhealthy. Even then, many of these problems can be fixed automatically by responders, so the best place to look for problems that require manual intervention is in the Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel. This will include the most recent probe error. Responders Finally, there are Responders, which are responsible for recovery and escalation actions. As their name implies, responders execute some sort of response to an alert that was generated by a monitor. When something is unhealthy, the first action is to attempt to recover that component. This could include multi-stage recovery actions; for example, the first attempt may be to restart the application pool, the second may be to restart the service, the third attempt may be to restart the server, and the subsequent attempt may be to take the server offline so that it no longer accepts traffic. If the recovery actions are unsuccessful, the system escalates the issue to a human through event log notifications. Responders take a variety of recovery actions, such as resetting an application worker pool or restarting a server. There are several types of responders:  Restart Responder Terminates and restarts a service.  Reset AppPool Responder Stops and restarts an application pool in Internet Information Services (IIS).  Failover Responder Initiates a database or server failover.  Bugcheck Responder Initiates a bugcheck of the server, thereby causing a server reboot.  Offline Responder Takes a protocol on a server out of service (rejects client requests).  Online Responder Places a protocol on a server back into production (accepts client requests).  Escalate Responder Escalates the issue to an administrator via event logging. In addition to the above listed responders, some components also have specialized responders that are unique to their component. All responders include throttling behavior, which provide a built-in sequencing mechanism for controlling responder actions. The throttling behavior is designed to ensure that the system isn’t compromised or made worse as a result of responder recovery actions. All responders are throttled in some fashion. When throttling occurs, the responder recovery action may be skipped or delayed, depending on the responder action. For example, when the Bugcheck Responder is throttled, its action is skipped, and not delayed. Health Sets From a reporting perspective, managed availability has two views of health, one internal and one external. The internal view uses health sets. Each component in Exchange 2016 (for example, Outlook sul Web, Exchange ActiveSync, the Information Store service, content indexing, transport services, etc.) is monitored by managed availability using probes, monitors, and responders. A group of probes, monitors and responders for a given component is called a health set. A health set is a group of probes, monitors, and responders that determine if that component is healthy. The current state of a health set (e.g., whether it is healthy or unhealthy) is determined by using the state of the health set’s monitors. If all of a health set’s monitors are healthy, then the health set is in a healthy state. If any monitor is not in a healthy state, then the health set state will be determined by its least healthy monitor. For detailed steps to view server health or health sets state, see Gestire i set di integrità e l'integrità del server. Health Groups The external view of managed availability is composed of health groups. Health groups are exposed to System Center Operations Manager 2012 R2. There are four primary health groups:  Customer Touch Points Components that affect real-time user interactions, such as protocols, or the Information Store.  Service Components Components without direct, real-time user interactions, such as the Microsoft Exchange Mailbox Replication service, or the offline address book generation process (OABGen).  Server Components The physical resources of the server, such as disk space, memory and networking.  Dependency Availability The server’s ability to access necessary dependencies, such as Active Directory, DNS, etc. When the Exchange Management Pack is installed, System Center Operations Manager (SCOM) acts as a health portal for viewing information related to the Exchange environment. The SCOM dashboard includes three views of Exchange server health:  Active Alerts Escalation Responders write events to the Windows event log that are consumed by the monitor within SCOM. These appear as alerts in the Active Alerts view.  Organization Health A roll up summary of the overall health of the Exchange organization health is displayed in this view. These rollups include displaying health for individual database availability groups, and health within specific Active Directory sites.  Server Health Related health sets are combined into health groups and summarized in this view. Overrides Overrides provide an administrator with the ability to configure some aspects of the managed availability probes, monitors, and responders. Overrides can be used to fine tune some of the thresholds used by managed availability. They can also be used to enable emergency actions for unexpected events that may require configuration settings that are different from the out-of-box defaults. Overrides can be created and applied to a single server (this is known as a server override), or they can be applied to a group of servers (this is known as a global override). Server override configuration data is stored in the Windows registry on the server on which the override is applied. Global override configuration data is stored in Active Directory. Overrides can be configured to last indefinitely, or they can be configured for a specific duration. In addition, global overrides can be configured to apply to all servers, or only servers running a specific version of Exchange. When you configure an override, it will not take effect immediately. The Microsoft Exchange Health Manager service checks for updated configuration data every 10 minutes. In addition, global overrides will be dependent on Active Directory replication latency. For detailed steps to view or configure server or global overrides, see Configurare le sostituzioni di disponibilità gestita. Management Tasks and Cmdlets There are three primary operational tasks that administrators will typically perform with respect to managed availability:  Extracting or viewing system health  Viewing health sets, and details about probes, monitors and responders  Managing overrides The two primary management tools for managed availability are the Windows Event Log and the Shell. Managed availability logs a large amount of information in the Exchange ActiveMonitoring and ManagedAvailability crimson channel event logs, such as:  Probe, monitor, and responder definitions, which are logged in the respective *Definition event logs.  Probe, monitor, and responder results, which are logged in the respective *Results event logs.  Details about responder recovery actions, including when the recovery action is started, and it is considered complete (whether successful or not), which are logged in the RecoveryActionResults event log. There are 12 cmdlets used for managed availability, which are described in the following table. Cmdlet Description Get-ServerHealth Used to get raw server health information, such as health sets and their current state (healthy or unhealthy), health set monitors, server components, target resources for probes, and timestamps related to probe or monitor start or stop times, and state transition times. Get-HealthReport Used to get a summary health view that includes health sets and their current state. Get-MonitoringItemIdentity Used to view the probes, monitors, and responders associated with a specific health set. Get-MonitoringItemHelp Used to view descriptions about some of the properties of probes, monitors, and responders. AddServerMonitoringOverride Used to create a local, server-specific override of a probe, monitor, or responder. GetServerMonitoringOverride Used to view a list of local overrides on the specified server. RemoveServerMonitoringOverride Used to remove a local override from a specific server. AddGlobalMonitoringOverride Used to create a global override for a group of servers. GetGlobalMonitoringOverride Used to view a list of global overrides configured in the organization. RemoveGlobalMonitoringOverride Used to remove a global override. Set-ServerComponentState Used to configure the state of one or more server components. Get-ServerComponentState Used to view the state of one or more server components. Make sense of the new Managed Availability feature The new platform built into Exchange 2013 may ease your monitoring burden. Our Exchange MVP breaks down how Managed Availability works and how it's designed to do its job. Managed Availability is one of Exchange 2013's interesting new features. It is essentially Exchange's built-in monitoring and remediation platform. It's the latter part -- remediation -- that continues to confuse Exchange admins. Because Managed Availability is tightly integrated into Exchange 2013, it will take mitigating actions whenever it discovers an issue. Actions depend on the issue, but rebooting the server -- referred to as a "bugcheck" -- is one example. Before diving into the nuts and bolts of Managed Availability, let's have a look at how the feature works and how it's designed to do its job. How Managed Availability maintains a server's health Managed Availability uses a multi-layered approach when it comes to evaluating and maintaining a server's health. At the foundation, a series of tests or polls (counters, for example) are executed; these tests are referred to as probes. Each of the probes yields a result, which is in turn inspected by the second layer in Managed Availability: monitors. As the name implies, monitors are the logic that interprets the results that the different probes yield. How a monitor interprets the result of a probe is programmatically determined; it's based on Microsoft's experience with the product and likely to be influenced by what the company sees is happening in Office 365. Depending on the results of the probes, monitors will either do nothing or kick off so-called responders. Responders are responsible for executing specific actions to try and remediate the potential issue one of the probes discovered. An example of a responder action is restarting a service. When the service restarts, a probe might be able to receive different -- and better -- results the next time it runs. The monitor might then determine the service is healthy again. However, sometimes the results are still negative, which could occur when the service stops again or because it's not performing well. In this case of negative results, many scenarios are possible. If the monitor determines the service is still unhealthy, it might use a different responder. This responder might then take a different action (for instance, restarting the server). The actions taken and the order of the actions depend on the definition of the monitor and what the component is. For some components, multiple responders will first be tried before the issue escalates, so you don't have to worry that a low-impact service might cause a server reboot. Additionally, responders are throttled, so certain actions can only be taken once or twice a day. If none of the actions solve the issue, Managed Availability will escalate the issue, meaning it will notify an Exchange admin that an issue exists and should be examined by raising an alert in the Event Logs. This event can then be picked up by whatever monitoring option you have, which in turn can alert the admin. The escalation of an issue is another responder that is programmed to create the alert (Figure 1). Figure 1 Health Manager Service and the worker process Managed Availability is composed of two different services or processes, much like the new Exchange 2013 store service. The Health Manager Service is the parent process that controls the Health Manager Worker, the child process. The worker process is responsible for executing the different tasks Managed Availability has to perform. The process hierarchy is clearly exposed when using Process Explorer(Figure 2), for example. Figure 2 The Health Manager Service isn't only responsible for starting or stopping the worker process -- it also ensures the worker process works correctly. If it finds that the process hangs (or that it didn't start), it will restart the process as needed. In a multi-site environment, the Health Manager Service itself is also being watched; only it's not a process on the server itself -- the Health Manager Service on another Exchange Server does so. A closer look at how Managed Availability stores information To learn Managed Availability's inner workings, Exchange admins must understand where it stores information and how to access it. There are a number of new features in Exchange 2013, but one feature admins should know about is Managed Availability.... It's built into Exchange as a monitoring and remediation platform -- and it's the remediation part that can be confusing. Once you've learned what Managed Availability is and how it looks on Exchange Server, you can understand where Managed Availability stores information and how to access it. Managed Availability mainly stores information for configuration and logging in two different places. The better part of its configuration information is stored in a number of xml files in the <installdrive>:\Program Files\Microsoft\Exchange\v15\Bin\Monitoring\Config folder. Whenever the Health Manager Service starts up, it reads the files and uses defined settings. While it's possible to alter these files, I don't recommend it. Not only is it dangerous -- an error might prevent the service from starting up -- but most of the settings are undocumented or poorly documented. These files should be a last resort when you need to make a change. Active Directory & server registry In addition to the local configuration files, Managed Availability also uses Active Directory and the server's local registry to store additional configuration information. More specifically, overrides are either stored in Active Directory -- so they're available to all servers -- or in the server's local registry, if they only apply to the local server. Event logs The event logs' Crimson Channel stores information for logging. As mentioned, Managed Availability logs every action it takes. At startup, the probe-, monitor- and responder definitions are stored in the corresponding event logs under Microsoft/Exchange/ActiveMonitoring (Figure 1). Figure 1 Managed Availability structures stored information into HealthSets, Probes, Monitors and Overrides. Here's a look at how each structures Managed Availability's information. HealthSets. Managed Availability uses a set of hundreds of probes, monitors and responders to determine a server's health. To keep track of all of them, each set that relates to a specific component is grouped into a so-called HealthSet. Running the Get-ServerHealth cmdlet on an Exchange Server reveals these HealthSets (Figure 2). Figure 2 For instance, there is a HealthSet named "ActiveSync." This HealthSet groups all of the probes, monitors and responders responsible for monitoring and mitigating the server's ActiveSync component. To view what monitors are part of this HealthSet, you can also use the Get-ServerHealth cmdlet and use the –HealthSet parameter to narrow down the results (Figure 3), as such: Get-ServerHealth <servername> -HealthSet ActiveSync Figure 3 Probes. Probes query or test a specific component on the server. There are a number of probe types, ranging from simple ones, which will fetch the value of a specific performance counter, to more complex ones, which will carry out a battery of tests like mimicking a user's behavior. These tests are also referred to as "synthetic transactions." Probes just fetch the information and carry out tests; they don't evaluate the results or values that are returned. To see what probes run on a specific server, you can have a look at the ProbeDefinition event log in the Crimson Channel. That's where the Health Manager Service writes the probes that will run on the system when it starts. The easiest way to get the information is through the GUI, or you can use PowerShell (Figure 4). Figure 4 The relevant portion of the information is written in XML format, but is a little more readable in Friendly View. Typically, there are two types of information you would get from probes: what probes are running on the system and what resources (e.g., Mailbox Database) they run on. Having that information is the first step to understanding what actions Managed Availability might have taken to remediate a problem it found. Monitors. The next step to decipher Managed Availability is looking at the different monitors that exist on the system. Similar to probes, these monitors are stored in the Crimson Channel's Monitor Definition Event log. Monitors behave in different ways and have different stages. Every time a monitor fails, it will move to the next stage, which will call upon a different responder to solve the issue. If the issue isn't solved after a set number of failed attempts, it escalates to an admin. When a monitor shifts into a new stage, the TimeOutInSeconds property in the StateTransitionXmlproperty of a monitor defines the shift (Figure 5). By using the Get-WinEvent cmdlet on the MonitorDefinition Event Log, you can see the different stages that are configured for a specific monitor; this article describes the process. The GUI is also an option, but PowerShell is more efficient. Note that there are different stages that exist for the AutodiscoverProxyTest monitor (Figure 5). Figure 5 A monitor can have multiple states, most of which aren't exposed to the admin. Typically, an admin would only see a monitor being Healthy or Unhealthy. The other states (Degrade, Unhealthy, Unrecoverable) are hidden and only visible through PowerShell. Overrides. Sometimes you don't want a specific monitor to run -- because it causes more trouble than help or when one of the configured defaults doesn't meet your own requirements, for example. Using monitoring overrides, you can reconfigure a monitor to use other threshold values. For example, a popular monitoring override is used to change the threshold Managed Availability uses to determine if there's enough free disk space left on a database log disk. You can use the following command to reset the value to 10 GB from the default value and configure it to last for 90 days. It's not possible to add an override that lasts indefinitely, so keep track of when an override expires so you can reconfigure it afterward. Add-GlobalMonitoringOverride -Identity MailboxSpace\DatabaseSizeMonitor -ItemType Monitor PropertyName ExtensionAttributes.DatabaseLogsThreshold PropertyValue 10GB -Duration 90.00:00:00 This command will add a global monitoring override that applies to all servers in the environment. That's also why global overrides are stored in Active Directory. If you want the override to apply only to a single server, you should use the AddServerMonitoringOverride cmdlet instead.

Exchange 2013 Managed Availability Explained

Related documents

Products

Support

Exchange 2013 Managed Availability Explained

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib