Uploaded by Luna Piena

Exchange 2013 Managed Availability

advertisement
Exchange 2013 Managed Availability
Introduction to Managed Availability:
Part I - An Exchange Administrator‘s task?
Microsoft introduced a new built-in monitoring system called Managed Availability in Exchange 2013, which automatically
takes recovery actions for unhealthy services within the Exchange organization.
Microsoft has been operating a cloud version of Exchange since 2007 and have put all their knowledge into Managed
Availability monitoring. Managed Availability is a cloud trained system based on an end user’s experience with recovery
oriented computing.
Managed Availability doesn’t mean you don’t have to monitor your on-prem or hybrid Exchange environment in fact, it’s
just the opposite. The long and complex PowerShell cmdlet’s used to monitor Exchange (which we will look at in more
detail later) are not the best and most effective method to do so.
Exchange 2013, or even better, the Exchange Diagnostics Service (EDS), collects a lot of performance data by default. Over
3,000 performance counters are compiled over seven days. The folder %Exchange
Install
Path%\Logging\Diagnostics\PerformanceLogsToBeProcessed collects and merges data onto the daily
performance log on a regular basis using the Microsoft Exchange Diagnostics service. You can find this folder under path
%Exchange Install Path%\ Logging\Diagnostics\DailyPerformanceLogs which is a .blg file type from the
PerfMon. Managed Availability uses these files, among others, to track the health of system components. The performance
counters are saved for 7 days or until 5 GB of data is reached by default. You can change these settings in the file called
Microsoft.Exchange.Diagnostics.Service.exe.config located in the bin directory of your Exchange
installation path:
<add Name=”DailyPerformanceLogs” LogDataLoss=”True” MaxSize=”5120″
MaxSizeDatacenter=”2048″ MaxAge=”7.00:00:00″ CheckInterval=”08:00:00″ />
Managed Availability has multiple HealthSet models that are responsible for different services, such as:





Client Protocols: OWA/ECP, ActiveSync, IMAP/POP, UM, Outlook, Compliance
Storage: DataProtection, Clustering, PublicFolders, SiteMailbox, Store
Mail Flow: FrontEndTransport, HubTransport, MailboxTransport, Deployment
Migration: MigrationMonitor, MRS
Fabric: Diskspace, MailboxSpace, ActiveDirectory, UserThrottling
The main constituents of Managed Availability are Probes, Monitors, and Responders.

Probes run every few minutes against different services, checks the health, and collects data from the server. These
results flow in the Monitoring component of Managed Availability. An Exchange 2013 multi-role server is defined by
hundreds of probes and in most cases, these Probes are not directly discoverable. This means that most of the Probes
are defined within the Exchange program code and not changeable. For example, customers reported the
AutoDiscoverSelfTestProbe failed when the ExternalUrl for the EWS virtual directory wasn’t set and there weree no
ways to change the probe settings. Therefore, Microsoft resolved this issue in Cumulative Update 6. The Probes write
an informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult crimson channel with
the following result types:





1 = Timeout
2 = Poisoned
3 = Succeeded
4 = Failed
5 = Quarantined

6 = Rejected
Probes are divided into three categories:




Reoccurring Probes: system performed tests for the end-to-end user experience, such as OWA connectivity.
Notifications: performs their own monitoring without the health manager framework by directly writing probe
results. For example, the MSExchangeDAGMgmt service logs a probe result without Managed Availability.
Checks: collects data from performance counters and logs events if the defined thresholds are exceeded or are
unmet.
Monitors are the central part of Managed Availability. All collected server data is examined to determine if action needs
to be taken based on a predefined rule set within the Monitors. Nearly all Monitors collect three types of data:










Direct notifications: Monitors become Unhealthy if a direct notification, for example from a service, changed the
Monitor state
Probe results: Monitors become Unhealthy if a Probe fails
Performance counters: Monitors become Unhealthy if a performance counter is higher or lower than the defined
threshold
Depending on the issue, a monitor can either initiate a responder or escalate the issue via an event log entry. Monitors
have the following various states:
Healthy: all collected Probe-data is within a normal state
Unhealthy: issue detected; either a recovery process started or escalated
Degraded: if a Monitor is in an unhealthy state < 60 seconds
Disabled: if a Monitor is manually disabled by an administrator
Unavailable: if the Microsoft Exchange Health service doesn’t get a query response from the Monitor
Repairing: to inform Managed Availability (or a monitoring software) that corrective actions are in progress
Many Monitors have high thresholds of multiple probe failures before becoming Unhealthy to avoid wrong recovery actions
taken by Managed Availability and the Responders. For problems that require manual intervention, take a look at the
Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel.









Responders take actions generated by an Unhealthy Monitor and perform recovery actions, such as resetting an IIS
application pool, initiating a database failover, or restarting a server. Managed Availability uses the following
Responder types:
Restart Responder: Terminates and restarts a service
Reset AppPool Responder: Recycles an IIS AppPool
Failover Responder: performs a mailbox database or server failover
Bugcheck Responder: initiates a bug check of the server (forcing a reboot)
Offline Responder: takes a protocol on a server, such as mapi/http, out of service and thus reject client requests
Online Responder: takes a protocol on a server, such as mapi/http, back into production and thus accept client
requests
Escalate Responder: writes an event log to inform an administrator
Specialized Component Responder: some specialized Responders that are unique to their component
If you would like to take a look at all recovery actions through the Managed Availability Responders, view the
Microsoft.Exchange.ManagedAvailability\RecoveryActionResults crimson channel.
This concludes part one of this article. In the second part, we will take a more practical approach to Managed Availability.
By using PowerShell we will show you how you can retrieve useful information from the massive amounts of data that
Managed Availability collects about your environment.
Part II - How to check, Recover, and Maintain your Exchange Organization Part II
Now that you’ve finished Part I of my three-part Managed Availability blog series, I will now go a bit deeper and provide
some examples about the functionality and operability of Managed Availability. My virtual test lab contains a two-member
DAG based on Windows Server 2012 and Exchange 2013 CU6.
Identify Unhealthy Health Sets and their error description
To get the server state, run the following cmdlet within the Exchange Management Shell:
Get-HealthReport -Server | where {$_.alertvalue -ne “Healthy” –and $_.AlertValue –ne
“Disabled”}
This cmdlet shows multiple HealthSets, which are Unhealthy. In this example, let’s take a look at the
HealthSet Clustering, which has 5 Monitors.
Note: the property “NotApplicable” shows whether Monitors have been disabled by Set-ServerComponentState for their
component. Most Monitors are not dependent on this, and thereby report “NotApplicable.”
Because the Clustering HealthSet has 5 Monitors, we check which Monitors are in an Unhealthy state:
Get-ServerHealth –Identity -HealthSet
The Monitor ClusterGroupMonitor is in an Unhealthy state. To get all the information, especially the appropriate Probe,
take a look at the Event Viewer in a readable outpout with the following cmdlet:
(Get-WinEvent -ComputerName -LogName Microsoft-ExchangeActiveMonitoring/MonitorDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?
{$_.Name -like “**”}
This output has two important values to identify the “real problem” of the Clustering HealthSet:
SampleMask:
defines
the
appropriate
substring
that
the
Probe ClusterGroupProbe for
Monitor ClusterGroupMonitor “ClusterGroupProbe\MSExchangeRepl” have in their name
the
ScenarioDescription: shows more information about the issue
From the output above, we found that Validate HA health is not impacted by cluster related issues and therefore wants to
fix it.
You can retake some Probe checks with the cmdlet Invoke-MonitoringProbe \ -Server | fl
Note: For reference, you can take a look at the Exchange
Sets: http://technet.microsoft.com/en-us/library/dn195892(v=exchg.150).aspx
2013
Management
Pack
Health
Important: this cmdlet is only available if your Exchange servers are configured for time zones UTC and UTC-. The cmdlet
doesn’t work with time zones UTC+ (hopefully Microsoft will fix this issue in the near future).
Let’s take a further look at the Probe configuration for the ClusterGroupProbe:
(Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition
| % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “**”}
The next step is to identify the complete error message so that every administrator knows what he or she has to do:
$Errors = (Get-WinEvent -ComputerName -LogName Microsoft-ExchangeActiveMonitoring/ProbeResult –FilterXPath
“*[UserData[EventXML[ResultName=‘/’][ResultType=’4′]]]” | %
{[XML]$_.toXml()}).event.userData.eventXml
$Errors | select -Property *time,result*,error*,*context
Result: the quorum resource “Cluster Group” is not online on server “xsrvmail2.“ Database Availability Group “E2K13-TAP”
may not reachable or may have lost redundancy.
Why could Managed Availability solve this issue not of itself?
Managed Availability is a “self-healing” component of Exchange 2013. As described in the steps above, responders are
responsible for trying to repair the Exchange organization on its own without any administrator impact. Let’s take a look
which Responders are relevant for the Unhealthy Clustering HealthSet:
To display all Probes, Monitors, and Responders of the HealthSet Clustering, run the following cmdlet in the Exchange
Management Shell:
Get-MonitoringItemIdentity –Identity –Server | ft name,itemtype,targetresource –AutoSize
You can see 3 Escalate Responders, based on the “Name” attribute:
ClusterEndpointEscalate
ClusterServiceCrashEscalate
ClusterHangEscalate
To identify the correct Responder for our Monitor ClusterGroupMonitor, run the following cmdlet in the Exchange
Management Shell:
$DefinedResponders = (Get-WinEvent –ComputerName -LogName Microsoft-ExchangeActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml
$DefinedResponders | ? {$_.AlertMask -like “**”} | fl
Name,AlertMask,EscalationSubject,EscalationMessage,UpdateTime
As you can see in the screenshot above, the appropriate Responder called ClusterGroupEscalatewith the parameter Name,
AlertMask, EscalationSubject, EscalationMessage, and UpdateTime.
Remember: Escalate Responders writes an entry in the Event Viewer to inform an administrator. This means that any issues
with the HealthSet Clustering cannot be recovered automatically through Managed Availability.
For completeness, let’s make an example with the HealthSet OWA.Protocol:
As you can see in the screenshot above, there are much more Responder types for the HealthSet Clustering.
To identify the correct Responder for our Monitor OwaSelfTestMonitor with all necessary information, run the following
cmdlet:
$DefinedResponders = (Get-WinEvent –ComputerName -LogName Microsoft-ExchangeActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml
$DefinedResponders | ? {$_.AlertMask -like “**”} | fl
Name,AlertMask,EscalationSubject,EscalationMessage,UpdateTime
You can see two Responders:
OwaSelfTestEscalate: ping request failed and an administrative intervention is needed (Escalate)
OwaSelfTestRestart: this Responder carried out a recovery action (but what exactly?)
To find out the recovery action from Responder OwaSelfTestRestart let’s grab all information about the Responder
configuration:
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “**”}
As you can see at the parameter ThrottlePolicyXml, which is customizable, there are some Responder definitions:







RecycleApplicationPool “MSExchangeOWAApPool” is self-explanatory
ThrottleConfig Enabled: if the ThrottlePolicyXml is enabled (True) or disabled (False)
LocalMinimumMinutesBetweenAttempts: how many actions can be taken on this server within the defined
timeframe
LocalMaximumAllowedAttemptsinOneHour: how many actions can be taken on this server within one hour
LocalMaximumAllowedAttemptsinADay: how many actions can be taken on this server within one day
GroupMinimumMinutesBetweenAttempts: how many actions can be taken in the DAG or array within the defined
timeframe
GroupMaximumAllowedAttemptsInADay: how many actions can be takten in the DAG or array within the defined
timeframe
Next, you should take a look at the Microsoft-Exchange-ManagedAvailability/RecoveryActionResults crimson channel for
entries. Event 500 indicates that a recovery action has begun, event 501 indicates that the action was taken has completed,
and event 501 indicates that the action threw an error.
Note: you have a better overview if you go directly into the Event Viewer in the log name Microsoft-ExchangeManagedAvailability/RecoveryActionResults. For specific troubleshooting, I prefer the Exchange Management Shell.
$RecoveryActionResults = Get-WinEvent -ComputerName -LogName Microsoft-ExchangeManagedAvailability/RecoveryActionResults
$XML = ($RecoveryActionResults | Foreach-object –Process
{[XML]$_.toXml()}).event.userData.eventXml
$XML | Where-Object {$_.State -eq “Finished” -and $_.RequestorName -eq “OwaSelfTestRestart”}
Note: you can filter your log to the current day if there are too many items logged. It’s easy to use the
parameter EndTime, such as the following cmdlet: $XML | Where-Object {$_.State -eq “Finished” -and $_.EndTime -like
“2014-08-28T18*” -and $_.RequestorName -eq “OwaSelfTestRestart”} RequestorName = your appropriate Responder,
such as “OwaSelfTestRestart”
The screenshot above demonstrates two different recovery actions:
-The first recycled the MSExchangeOWAAppPool from last year (yes, Exchange works very well)
-The second created parallel a Watson Dump because the MSExchangeOWAAppPool application crashed.
For a general overview of all recovery actions,
ManagedAvailability/RecoveryActionsLog crimson channel:
take
a
look
at
the
Microsoft-Exchange-
I prefer to use the Event Viewer because it is clearer. But for those who like to search for specific recovery actions or at an
individual time, feel free to use the Exchange Management Shell and create your own additional filters if you need it:
$RecoveryActionLogs = Get-WinEvent -ComputerName -LogName Microsoft-ExchangeManagedAvailability/RecoveryActionLogs
$XML = ($RecoveryActionLogs | Foreach-object –Process
{[XML]$_.toXml()}).event.userData.eventXml
$XML | Where-Object {$_.State -eq “Finished” -and $_.RequestorName -eq
“OwaSelfTestRestart”}
– Happy Monitoring! =)
Part III - Local Monitoring Files and Overrides
Now that you’ve finished Part I & Part II of my three part Managed Availability blog series, I will now provide some
information about local .xml monitoring files and overrides of Managed Availability.
Local Managed Availability .xml monitoring files
Some HealthSets, such as the FEP HealthSet are local .xml files. Because FEP is the Forefront Endpoint Protection service,
some of you may want to disable this HealthSet on the servers, because there is no use for it.
Browse to %ExchangeInstallationPath%\Microsoft\Exchange\V15\Bin\Monitoring\Config, search
for FEPActiveMonitoringContext.xml and open the file with an editor, such as Notepad.
Change line 12 by replacing Enabled = True to Enabled = False
Restart the Microsoft Exchange Health Management service on the server where you modified the .xml file.
Overrides
With overrides, you can change the Managed Availability monitoring thresholds and define you own settings when
Managed Availability in case of errors should take action.
There are two kinds of overrides:


Local overrides: are used to customize a component on a specific server or on components which aren’t globally
available. For example, if you are running multiple data centers and would like to change only server components
on a specific location for individual monitoring. Local overrides are managed with the *SetMonitoringOverride set
of
cmdlets.
They
are
stored
in
the
registry
under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overr
ides\ and are automatically updated every 10 minutes. The Microsoft Exchange Health Management service
reads the changes in the registry path above.
Global overrides: are used to customize a component for a whole Exchange organization. They are managed with
the *-GlobalMonitoringOverride set of cmdlets. Global overrides are stored in Active Directory:
CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft
Exchange,CN=Services,CN=Configuration,DC=Xiopia,DC=local
You can set overrides for specific Exchange versions, such as CU6 with version “15.0.995.29”. This setting will then be
effective until the Exchange version changes and will be set with the ApplyVersion parameter.
The other method is to set overrides for a specific timeframe. With Exchange 2013 CU6 you can set overrides for a
maximum of 365 days with the Duration parameter.
Managed Availability and server reboots
Responders only execute in the event that a monitor is marked in an Unhealthy state and will try to recover that
component. Managed Availability provides multi-stage recovery actions:
1.
2.
3.
4.
Restart the application pool
Restart the service
Restart the server
Take the server offline so that it no longer accepts traffic
There are several types of responders available:
Restart Responder, Rest AppPool Responder, Failover Responder, Bugcheck Responder, Offline Responder, Escalate
Responder, and Specialized Component Responders.In this article, I will be primarily discussing Restart Responders.
Restart responders are subject to throttling policies. This means, the responder definition contains a section
“ThrottlePolicyXML” which can be overridden if desired. For example, we use the “StoreServiceKillServer” responder. To
view the definitions, use the following cmdlet via EMS:
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “StoreServiceKillServer”}
There are many parameters, such as ServiceName, CreatedTime, Enabled, MaxRetryAttempts, AlertMask, and so on. The
following section from the restart responder definition is important:
ThrottlePolicyXml :
LocalMaximumAllowedAttemptsInOneHour=”-1″ LocalMaximumAllowedAttemptsInADay=”1″
GroupMinimumMinutesBetweenAttempts=”600″ GroupMaximumAllowedAttemptsInADay=”4″ />
The thresholds are self-explanatory. The only difference is “Local” and “Group.” Local means one Exchange server and
group means there are more than one Exchange servers in your organization. You have to check and configure the setting
based on your needs.
To prevent a reboot, create a local or global override:
I was looking for a “*ForceReboot*” by Managed Availability and found the following Requester:
(Get-WinEvent -LogName Microsoft-Exchange-ManagedAvailability/* | %
{[XML]$_.toXml()}).event.userData.eventXml| ?{$_.ActionID -like “*ForceReboot*”} | ft
RequesterName
ServiceHealthMSExchangeReplForceReboot
Add-GlobalMonitoringOverride -Identity Exchange\ServiceHealthMSExchangeRepIForceReboot ItemType Responder -PropertyName Enabled -PropertyValue 0 –Duration 90:00:00:00
To check the configuration changes, use the following cmdlet:
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/responderdefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like
“ServiceHealthMSExchangeRepIForceReboot “} | ft name,enabled
This prevents the server from a force reboot in case of errors with the “ServiceHealthMSExchangeRepl” health set. Enabled
must be “0” (instead of 1).
Inform Managed Availability about the repairing process
To inform Managed Availability (and your monitoring software too) that you are in a repairing process, use the following
cmdlet and define using the Name Parameter for the appropriate Monitor:
Set-ServerMonitor –Server -Name Maintenance –Repairing $true
After repairing:
Set-ServerMonitor –Server -Name Maintenance –Repairing $false
To avoid automatic recovery actions, you should disable the managed service using Set-ServerComponentState:
Set-ServerComponentState –Component RecoveryActionsEnabled –Identity -State
Inactive –Requester Functional
After finishing recovery you have to enable the RecoveryActionsEnabled component with the following cmdlet:
Set-ServerComponentState –Component RecoveryActionsEnabled –Identity -State Active
–Requester Functional
Managed Availability – an administrative overview
1. Configuration of Managed Availability
After installing Exchange 2013 in production, there might be some HealthSets in an Unhealthy state.
2. AlertValue “Unhealthy”
The first step you have to do is a HealthReport from the entire server:
Get-HealthReport -Server | where {$_.alertvalue -ne “Healthy”}
Note: The property “NotApplicable” shows whether Monitors have been disabled by Set-ServerComponentState
for their component. Most Monitors are not dependent on this, and report “NotApplicable”.
Let’s take a look at HealthSet “Autodiscover.Protocol” and why it’s in an Unhealthy state.
To get all information about the Autodiscover.Protocol HealthSet, we have to analyze the Monitoring Item
Identity:
Get-MonitoringItemIdentity -Identity Autodiscover.Protocol -Server | ft name,itemtype,
Targetresource –AutoSize
We can see all appropriate Probes, Monitors, and Responders regarding the Autodiscover.Protocol HealthSet.
Because the Autodiscover.Protocol HealthSet has 9 Monitors, we check which Monitors are in an Unhealthy
state:
Get-ServerHealth –Identity -HealthSet Autodiscover.Protocol
The Monitor AutodiscoverSelfTestMonitor is in an Unhealthy state.
To get all the information about the appropriate Monitor (AutodiscoverSelfTestMonitor), we collect the
information directly from the Eventlog in a readable output:
(Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/Monitordefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “AutodiscoverSelfTestMonitor”}
This output has two important values for us:
An Unhealthy state will be transition within 600 seconds to Unhealthy. The ScenarioDescription parameter tells
the administrator more details about the issue. “Validate EWS health is not impacted by any issues”.
In most cases you can retake the test with the following cmdlet:
Invoke-MonitoringProbe Autodiscover.Protocol\AutodiscoverSelfTestProbe -Server | fl
Actually not all protocols can be invoked manually with the Invoke-MonitoringProbe cmdlet. We hope Microsoft
will fix this for the future.
We know that EWS has some issues and go further to take a look at the Probe definition for the
AutodiscoverSelfTestMonitor:
(Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/Probedefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “AutodiscoverSelfTestProbe”}
You can see the configuration how the Probe will test the Autodiscover.Protocol HealthSet.
The next step is to collect the events in the ProbeResult event log crimson channel and filter them. In this
example, you look for failure results related to AutodiscoverSelfTestProbe:
$Errors = (Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/ProbeResult –
FilterXPath
“*[UserData[EventXML[ResultName=’AutodiscoverSelfTestProbe/’][ResultType=’4′]]]”
|
%
{[XML]$_.toXml()}).event.userData.eventXml
$Errors | select -Property *time,result*,error*,*context
ResultTypes:
1 = Timeout
2 = Poisoned
3 = Succeeded
4 = Failed
5 = Quarantined
6 = Rejected
You can see the complete error message at “ExecutionContext”.
Note: to identify the correct ResultName parameter, you can take a look at the event log directly:
Expand Applications and Services Logs – Microsoft – Exchange– ActiveMonitoring – ProbeResult and search for
“AutodiscoverSeltTestMonitor” and filter only the Error level events. On the correct error event, click on details
and take a look at the Error line:
System.Exception:
Autodiscover
Service
failed
Microsoft.Exchange.Monitoring.ActiveMonitoringProbe
to
return
the
ExternalEWSUrl
at
Result:
The ExternalEWSUrl value is empty. You have to set the value (it must not be available from the Internet) to avoid
Managed Availability error messages.
It’s not recommended to disable the AutodiscoverSelfTestProbe and their appropriate Monitors and Responders,
because there are a lot of more important tests. So don’t set a global or local override against the Probes,
Monitors, and Responders from the Autodiscover.Protocol HealthSet.
Update 08/26/2014: Microsoft fixed this "issue" with Exchange Server 2013 Cumulative Update 6 (CU6) and it's not relevant if the
ExternalEWSUrl is either set or not.
3. Local Managed Availability .xml monitoring files
Some HealthSets, such as the FEP HealthSet are local .xml files. FEP is the ForeFront service and some of you
would want to disable this HealthSet on the servers.
Browse to %ExchangeInstallationPath%\Microsoft\Exchange\V15\Bin\Monitoring\Config,
FEPActiveMonitoringContext.xml and open the file with an editor, such as Notepad.
search
for
Change line 12 and replace Enabled = True to Enabled = False
Restart the Microsoft Exchange Health Management service on the server where you modified the .xml file.
4. Inform Managed Availability about the repairing process
To inform MA (and for example SCOM) that you are in a repairing process, use the following cmdlet and define
with the -Name Parameter the appropriate Monitor:
Set-ServerMonitor –Server -Name Maintenance –Repairing $true
After repairing:
Set-ServerMonitor –Server -Name Maintenance –Repairing $false
To avoid automatically recovery actions, you should disable the managed service using SetServerComponentState:
Set-ServerComponentState –Component RecoveryActionsEnabled –Identity -State Inactive –Requester
Functional
After finishing recovery you have to enable the RecoveryActionsEnabled Component with the following cmdlet:
Set-ServerComponentState –Component RecoveryActionsEnabled –Identity -State Active –Requester Functional
5. Overrides
You can set overrides: server overrides, which apply to a single server, and global overrides, which apply to all
Exchange server within your organization. You apply overrides using the Add-ServerMonitoringOverride or AddGlobalMonitoringOverride cmdlet.
Important: you have to disable all the appropriate Probe, Monitor, and Responder!
Example to deactivate the Public Folder monitoring from the Microsoft KB database (Exchange 2013 CU3):
Add-GlobalMonitoringOverride –Identity “Publicfolders\PublicFolderLocalEWSLogonEscalate” –ItemType
“Responder” –PropertyName Enabled –PropertyValue 0 –ApplyVersion “15.0.775.38”
Add-GlobalMonitoringOverride –Identity “Publicfolders\PublicFolderLocalEWSLogonMonitor” –ItemType
“Monitor” –PropertyName Enabled –PropertyValue 0 –ApplyVersion “15.0.775.38”
Add-GlobalMonitoringOverride –Identity “Publicfolders\PublicFolderLocalEWSLogonProbe” –ItemType “Probe”
–PropertyName Enabled –PropertyValue 0 –ApplyVersion “15.0.775.38”
Also you can change the ExtensionAttributes from every Probe:
ExtensionAttributes:
127.0.0.125InboundProxyProbe<DataAddAttributions=”false”>X-Exchange-Probe-Drop-Message:FrontEnd-CAT250 Subject:Inbound proxy probeNone
Local override:
Add-ServerMonitoringOverride -Server ServerName -Identity FrontEndTransport\OnPremisesInboundProxy ItemType Probe -PropertyName ExtensionAttributes -PropertyValue
‘Exch1.contoso.com25InboundProxyProbeX-Exchange-Probe-Drop-Message:FrontEnd-CAT-250Subject:Inbound
proxy probeNone’ -Duration 45.00:00:00
Global override:
Add-GlobalMonitoringOverride -Identity “EWS.Proxy\EWSProxyTestProbe” -ItemType Probe -PropertyName
TimeoutSeconds -PropertyValue 25 –ApplyVersion “15.0.712.24”
6. Managed Availability and server reboots
Responders only execute in the event if a monitor is marked in a Unhealty state and will try to recover that
component. Managed Availability provides multi-stage recovery actions:
1.
2.
3.
4.
Restart the application pool
Restart the service
Restart the server
Take the server offline so that it no longer accepts traffic
There are several types of responders available:
Restart Responder, Rest AppPool Responder, Failover Responder, Bugcheck Responder, Offline Responder,
Escalate Responder, and Specialized Component Responders.
Today we talk primary about the Restart Responders. But you can also read the entire Managed Availability
documentation which is available at http://blogs.technet.com/b/exchange/archive/2012/09/21/lessons-fromthe-datacenter-managed-availability.aspx.
Restart responders are subject to throttling policies. This means, the responder definition contains a section
“ThrottlePolicyXML” and you can override them, if you like. For example, we use the “StoreServiceKillServer”
responder. To view the definitions, use the following cmdlet via EMS:
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “StoreServiceKillServer”}
There are many parameters, such as ServiceName, CreatedTime, Enabled, MaxRetryAttempts, AlertMask, and
so on. Importan for us is the following section from the restart responder definition:
ThrottlePolicyXml :
< ForceReboot ResourceName=””>
< ThrottleConfig Enabled=”True” LocalMinimumMinutesBetweenAttempts=”720″
LocalMaximumAllowedAttemptsInOneHour=”-1″ LocalMaximumAllowedAttemptsInADay=”1″
GroupMinimumMinutesBetweenAttempts=”600″ GroupMaximumAllowedAttemptsInADay=”4″ />
< /ForceReboot>
The thresholds are self-explanatory. The only difference is “Local” and “Group”. Local means one Exchange
server, group means more than one Exchange server in your organization. You have to check and configure the
setting for your needs.
To prevent a reboot, create a local or global override:
Example:
I was looking for a “*ForceReboot*” by Managed Availability and found the following Requester:
(Get-WinEvent -LogName Microsoft-Exchange-ManagedAvailability/* | %
{[XML]$_.toXml()}).event.userData.eventXml| ?{$_.ActionID -like “*ForceReboot*”} | ft
RequesterName ServiceHealthMSExchangeReplForceReboot
Add-GlobalMonitoringOverride -Identity Exchange\ServiceHealthMSExchangeRepIForceReboot -ItemType
Responder -PropertyName Enabled -PropertyValue 0 –Duration 60:00:00:00
To check the configuration changes, use the following cmdlet:
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/responderdefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “ServiceHealthMSExchangeRepIForceReboot “} |
ft name,enabled
This prevents the server from a force reboot. Enabled must be “0” (instead of 1).
Lessons from the Datacenter: Managed Availability
Monitoring is a key component for any successful deployment of Exchange. In previous releases, we
invested heavily in developing a correlation engine and worked closely with the System Center
Operations Manager (SCOM) product group to provide a comprehensive alerting solution for Exchange
environments.
Traditionally, monitoring involved collecting data and if necessary performing an action based on the
collected data. For example, in the context of SCOM, different mechanisms were used to collect data via
the Exchange Management Pack:
Type of Monitoring Exchange 2010
Service not running
Health manifest rule
Performance counter Health manifest counter rule
Exchange events
Health manifest event rule
Non-Exchange events Health manifest event rule
Scripts, cmdlets
Health manifest script rule
Test cmdlets
Test cmdlets
Exchange Server 2013 Monitoring Goals
When we started the development of Exchange 2013, a key focus area was improving end-to-end
monitoring for all Exchange deployments, from the smallest on-premises deployment to largest
deployment in the world, Office 365. We had three major goals in mind:
1. Bring our knowledge and experience from the Office 365 service to on-premises customers For
nearly 6 years, the Exchange product group has been operating Exchange Online. The operations model
we use is known as the developer operations model (DevOps). In this model, issues are escalated directly
to the developer of a feature if their feature is performing badly within the service or when a customer
reports an unknown problem that is escalated. Regardless of the problem origin, escalation directly to the
developer brings accountability to the development of the software by addressing the software defects.
As a result of using this model, we have learned a lot. For example, in the Exchange 2010
Management Pack, there are almost 1100 alerts. But over the years, we found that only around
150 of those alerts are useful in Office 365 (and we disable the rest). In addition, we found that
when a developer receives an escalation, they are more likely to be accountable and work to
resolving the issue (either through a code fix, through new recovery workflows, by fine tuning the
alert, etc.) because the developer does not want to be continuously interrupted or woken up to
troubleshoot the same problem over and over again. Within the DevOps model, we have a
process where every week the on-calls will have a hand-off meeting to review of the past week’s
incidents; the result is that in-house recovery workflows are developed, such as resetting IIS
application pools, etc. Before Exchange 2013, we had our own in-house engine that handled these
recovery workflows. This knowledge never made it into the on-premises products. With Exchange
2013, we have componentized the recovery workflow engine so that the learnings can be shared
with our on-premises customers.
2. Monitor based on the end user’s experience We also learned that a lot of the methodologies we used
for monitoring really didn’t help us to operate the environment. As a result, we are shifting to a customer
focused vision for how we approach monitoring.
In past releases, each component team would develop a health model by articulating all the
various components within their system. For example, transport is made up of SMTP-IN, SMTPOUT, transport agents, categorizer, routing engine, store driver, etc. The component team would
then build alerts about each of these components. Then in the Management Pack, there are alerts
that let you know that store driver has failed. But what is missing is the fact that the alert doesn’t
tell you about the end-to-end user experience, or what might be broken in that experience. So in
Exchange 2013, we are attempting to turn the model upside down. We aren’t getting rid of our
system level monitoring because that is important. But what is really important, if you want to
manage a service, is what are your users seeing? So we flipped the model, and are endeavoring
to try and monitor the user experience.
3. Protect the user’s experience through recovery oriented computing With the previous releases of
Exchange, monitoring was focused on the system and components, and not how to recover and restore
the end user experience automatically.
Monitoring Exchange Server 2013 - Managed Availability
Managed Availability is a monitoring and recovery infrastructure that is integrated with Exchange’s high
availability solution. Managed Availability detects and recovers from problems as they occur and as they
are discovered.
Managed Availability is user-focused. We want to measure three key aspects – the availability, the
experience (which for most client protocols is measured as latency), and whether errors are occurring.
To understand these three aspects, let’s look at an example, a user using Outlook Web App (OWA).
The availability aspect is simply whether or not the user can access the OWA forms-based
authentication web page. If they can’t, then the experience is broken and that will generate a help desk
escalation. In terms of experience, if they can log into OWA, what is the experience like – does the
interface load, can they access their mail? The last aspect is latency–the user is able to log in and access
the interface, but how fast is the mail rendered in the browser? These are the areas that make up the
end user experience.
One key difference between previous releases and Exchange 2013, is that in Exchange 2013 our
monitoring solution is not attempting to deliver root cause (this doesn’t mean the data isn’t recorded in
the logs, or dumps aren’t generated, or that root cause cannot be discovered). It is important to
understand that with past releases, we were never effective at communicating root cause - sometimes
we were right, often times we were wrong.
Components of Managed Availability
Managed Availability is built into both server roles in Exchange 2013. It includes three main
asynchronous components. The first component is the probe engine. The probe engine’s responsibility
is to take measurements on the server. This flows into the second component, which is the monitor. The
monitor contains the business logic that encodes what we consider to be healthy. You can think of it as
a pattern recognition engine; it is looking for different patterns within the the different measurements
we have, and then it can make a decision on whether something is considered healthy or not. Finally,
there is the responder engine, what I have labeled as Recover in the below diagram. When something is
unhealthy its first action is to attempt to recover that component. Managed Availability provides multistage recovery actions – the first attempt might be to restart the application pool, the second attempt
might be to restart service, the third attempt might be to restart the server, and the final attempt may
be to offline the server so that it no longer accepts traffic. If these attempts fail, managed availability
then escalates the issue to a human through event log notification.
You may also notice we have decentralized a few things. In the past we had the SCOM agent on each
server and we’d stream all measurements to a central SCOM server. The SCOM server would have to
evaluate all the measurements to determine if something is healthy or not. In a high scale environment,
complex correlation runs hot. What we noticed is that alerts would take longer to fire, etc. Pushing
everything to one central place didn’t scale. So instead we have each individual server act as an island –
each server executes its own probes, monitors itself, and takes action to self-recover, and of course,
escalate if needed.
Figure 1: Components of Managed Availability
Probes
The probe infrastructure is made up of three distinct frameworks:
1. Probes are synthetic transactions that are built by each component team. They are similar to the test
cmdlets in past releases. Probes measure the perception of the service by executing synthetic end-to-end
user transactions.
2. Checks are a passive monitoring mechanism. Checks measure actual customer traffic.
3. The notify framework allows us to take immediate action and not wait for a probe to execute. That way
we if we detect a failure, we can take immediate action. The notify framework is based on notifications.
For example, when certificate expires, a notification event is triggered, alerting operations that the
certificate needs to be renewed.
Monitors
Data collected by probes are fed into monitors. There isn’t necessarily a one-to-one correlation between
probes and monitors; many probes can feed data into a single monitor. Monitors look at the results of
probes and come to a conclusion. The conclusion is binary – either the monitor is healthy or it is
unhealthy.
As mentioned previously, Exchange 2013 monitoring focuses on the end user experience. To accomplish
that we have to monitor at different layers within the environment:
Figure 2: Monitoring at different layers to check user experience
As you can see from the above diagram, we have four different checks. The first check is the Mailbox
self-test; this probe validates that the local protocol or interface can access the database. The second
check is known as a protocol self-test and it validates that the local protocol on the Mailbox server is
functioning. The third check is the proxy self-test and it executes on the Client Access server, validating
that the proxy functionality for the protocol is functioning. The fourth and last check is the all-inclusive
check that validates the end-to-end experience (protocol proxy to the store functions). Each check
performs detection at different intervals.
We monitor at different layers to deal with dependencies. Because there is no correlation engine in
Exchange 2013, we try to differentiate our dependencies with unique error codes that correspond to
different probes and with probes that don’t include touching dependencies. For example, if you see a
Mailbox Self-Test and a Protocol Self-test probe both fail at the same time, what does it tell you? Does
it tell you store is down? Not necessarily; what it does tell you is that the local protocol instance on the
Mailbox server isn’t working. If you see a working Protocol Self-Test but a failed Mailbox Self-Test, what
does that tell you? That scenario tells you that there is a problem in the “storage” layer which may
actually be that the store or database is offline.
What this means from a monitoring perspective, is that we now have finer control over what alerts are
issued. For example, if we are evaluating the health of OWA, we are more likely to delay firing an alert
in the scenario where we have a failed Mailbox Self-Test, but a working Protocol Self-Test; however an
alert would be fired if both the Mailbox Self-Test and Protocol Self-Test monitors are unhealthy.
Responders
Responders execute responses based on alerts generated by a monitor. Responders never execute
unless a monitor is unhealthy.
There are several types of responders available:







Restart Responder Terminates and restarts service
Reset AppPool Responder Cycles IIS application pool
Failover Responder Takes an Exchange 2013 Mailbox server out of service
Bugcheck Responder Initiates a bugcheck of the server
Offline Responder Takes a protocol on a machine out of service
Escalate Responder escalates an issue
Specialized Component Responders
The offline responder is used to remove a protocol from use on the Client Access servers. This responder
has been designed to be load balancer-agnostic. When this responder is invoked, the protocol will not
acknowledge the load balancer health check, thereby enabling the load balancer to remove the server
or protocol from the load balancing pool. Likewise, there is a corresponding online responder that is
automatically initiated once the corresponding monitor becomes healthy again (assuming there are no
other associated monitors in an unhealthy state) – the online responder simply allows the protocol to
respond to the load balancer health check, which enables the load balancer to add the server or protocol
back into the load balancer pool. The offline responder can be invoked manually via the SetServerComponentState cmdlet, as well. This enables administrators to manually put Client Access servers
into maintenance mode.
When the escalate responder is invoked, it generates a Windows event that the Exchange 2013
Management Pack recognizes. It isn’t a normal Exchange event. It’s not an event that says OWA is broken
or we’ve had a hard IO. It’s an Exchange event that that a health set is unhealthy or healthy. We use
single instance events like that to manipulate the monitors inside SCOM. And we’re doing this based on
an event generated in the escalate responder as opposed to events spread throughout the product.
Another way to think about it is a level of indirection. Managed Availability decides when we flip a
monitor inside SCOM. Managed Availability makes the decision as to when an escalation should occur,
or in other words, when a human should get engaged.
Responders can also be throttled to ensure that the entire service isn’t compromised. Throttling differs
depending on the responder:




Some responders take into account the minimum number of servers within the DAG or load balanced CAS pool
Some responders take into account the amount of time between executions.
Some responders take into account the number of occurrences that the responder has been initiated.
Some may use any combination of the above.
Depending on the responder, when throttling occurs, the responder’s action may be delayed or simply
skipped.
Recovery Sequences
It is important to understand that the monitors define the types of responders that are executed and the
timeline in which they are executed; this is what we refer to as a recovery sequence for a monitor. For
example, let’s say the probe data for the OWA protocol (the Protocol Self-Test) triggers the monitor to
be unhealthy. At this point the current time is saved (we’ll refer to this as T). The monitor starts a recovery
pipeline that is based on current time. The monitor can define recovery actions at named time intervals
within the recovery pipeline. In the case of the OWA protocol monitor on the Mailbox server, the recovery
sequence is:
1. At T=0, the Reset IIS Application Pool responder is executed.
2. If at T=5 minutes the monitor hasn’t reverted to a healthy state, the Failover responder is initiated and
databases are moved off the server.
3. If at T=8 minutes the monitor hasn’t reverted to a healthy state, the Bugcheck responder is initiated and
the server is forcibly rebooted.
4. If at T=15 minutes the monitor still hasn’t reverted to a healthy state, the Escalate responder is triggered.
The recovery sequence pipeline will stop when the monitor becomes healthy. Note that the last named
time action doesn’t have to complete before the next named time action starts. In addition, a monitor
can have any number of named time intervals.
Systems Center Operations Manager (SCOM)
Systems Center Operations Manager (SCOM) is used as a portal to see health information related to the
Exchange environment. Unhealthy states within the SCOM portal are triggered by events written to the
Application log via the Escalate Responder. The SCOM dashboard has been refined and now has three
key areas:



Active Alerts
Organization Health
Server Health
The Exchange Server 2013 SCOM Management Pack will be supported with SCOM 2007 R2 and SCOM
2012.
Overrides
With any environment, defaults may not always be the optimum condition, or conditions may exist that
require an emergency action. Managed Availability includes the ability enable overrides for the entire
environment or on an individual server. Each override can be set for a specified duration or to apply to
a specific version of the server. The *-ServerMonitoringOverride and *-GlobalMonitoringOverride
cmdlets enable administrators to set, remove, or view overrides.
Health Determination
Monitors that are similar or are tied to a particular component’s architecture are grouped together to
form health sets. The health of a health set is always determined by the “worst of” evaluation of the
monitors within the health set – this means that if you have 9 monitors within a health set and 1 monitor
is unhealthy, then the health set is considered unhealthy. You can determine the collection of monitors
(and associated probes and responders) in a given health set by using the Get-MonitoringItemIdentity
cmdlet.
To view health, you use the Get-ServerHealth and Get-HealthReport cmdlets. Get-ServerHealth is used
to retrieve the raw health data, while Get-HealthReport operates on the raw health data and provides a
current snapshot of the health. These cmdlets can operate at several layers:


They can show the health for a given server, breaking it down by health set.
They can be used to dive into a particular health set and see the status of each monitor.

They can be used to summarize the health of a given set of servers (DAG members, or load-balanced array of
CAS).
Health sets are further grouped into functional units called Health Groups. There are four Health Groups
and they are used for reporting within the SCOM Management Portal:
1.
2.
3.
4.
Customer Touch Points – components with direct real-time, customer interactions (e.g., OWA).
Service Components – components without direct, real-time, customer interaction (e.g., OAB generation).
Server Components – physical resources of a server (e.g., disk, memory).
Dependency Availability – server’s ability to call out to dependencies (e.g., Active Directory).
Conclusion
Managed Availability performs a variety of health assessments within each server. These regular, periodic
tests determine the viability of various components on the server, which establish the health of the server
(or set of servers) before and during user load. When issues are detected, multi-step corrective actions
are taken to bring the server back into a functioning state; in the event that the server is not returned to
a healthy state, Managed Availability can alert operators that attention is needed.
The end result is that Managed Availability focuses on the user experience and ensures that while issues
may occur, the experience is minimally impacted, if at all, impacted.
Ross
Smith
IV Greg "The All-father of Exchange HA" Thiel
Principal Program Manager Principal
Program
Exchange Customer Experience Exchange Server
Manager
Architect
Exchange 2013 Health and Server Reports
Microsoft Exchange Server 2013 introduced new monitoring of the Exchange subsystem, which was also improved by
the release of CU1. This feature is known as Managed Availability. A good description of this feature can be found on the
Exchange Team blog in this post –http://blogs.technet.com/b/exchange/archive/2012/09/21/lessons-from-thedatacenter-managed-availability.aspx. What my blog post concentrates on is how to access this information and make it
useable to the Exchange Admin/Engineer.
In the RTM version of Exchange 2013 the commands get-serverhealth and get-healthreport were connected more closely
with one piping into the other. CU1 for Exchange 2013 now allows these PowerShell commands to be run independently
of each other.
Get-HealthReport provides the relative health of various components in Exchange and what state it is in – online, partially
online, offline, sidelined, functional, or unavailable. The healthsets available are:
Autodiscover.Protocol, ActiveSync, ActiveSync.Protocol,
Autodiscover, Autodiscover.Proxy,ActiveSync.Proxy, ECP.Proxy, EWS.Proxy, OAB.Proxy, OWA.Proxy, RPS.Proxy, RWS.P
roxy,Outlook.Proxy, Antimalware, AD, ECP, Ediscovery.Protocol, EDS, EventAssistants, EWS.Protocol, DataProtection,
EWS, FIPS, FrontendTransport, HubTransport, HDPhoto, Monitoring, Clustering, DiskController, AntiSpam,
FfoQuarantine, MailboxTransport, MSExchangeCertificateDeployment, OWA.Protocol.Dep, MailboxSpace,
MailboxMigration, MRS, Network, Search, OWA.Protocol, OWA, PublicFolders, Transport, RPS, SiteMailbox, Outlook,
Outlook.Protocol, Store, UM.CallRouter, UM.Protocol, UserThrottling, DAL, Security, IMAP.Protocol, Datamining,
Provisioning, POP.Protocol, ProcessIsolation, TransportSync, MessageTracing, CentralAdmin, OAB, Calendaring,
PushNotifications.Protocol and RemoteMonitoring.
Now let’s take a look at the output of the command:
What does this report tell us?
The report gives the admin or engineer a quick view of the Exchange 2013’s relative health. In the sample report that
was attached you can see that there were several Health Sets that show as Unhealthy. However, we don’t know exactly
know what that means. Let’s dig a bit deeper into each health set that shows healthy. First, let’s get just that subset of
Health Sets. Just run this command:
get-healthreport -server ds-l1-e2k13-01 | where {$_.alertvalue -ne “healthy”} | ft -auto
Let’s go a step further and get more information on each of these Health Sets and which tests triggered the non-Healthy
state:
$test = get-healthreport -server ds-l1-e2k13-01 | where {$_.alertvalue -ne “healthy”}
foreach ($line in $test) {$line.entries | where {$_.alertvalue -ne “healthy”} | ft -auto}
Now we have more information to work with, but still we don’t know why these tests failed. How do we get this
information?
foreach ($line in $test) {$line.entries | where {$_.alertvalue -ne “healthy”} | ft server, name, healthgroupname,
alertvalue, lastexecutionresult -auto}
Now that we have all the information we can squeeze out of the Get-HealthReport command, the next step would be to
see why any of these fail. I would begin typical Exchange troubleshooting by looking at event logs as well as turning up
diagnostic logging.
Related Information
Get-HealthReport
Get-ServerHealth
Technet article 1
Technet article 2
Da <https://justaucguy.wordpress.com/2013/06/06/exchange-2013-health-and-server-reports-ps-part-1/>
Exchange 2013 Health and Server Reports (PS) – Part 2
In the second part of the series we’ll explore the get-serverhealth PowerShell command and what information can be
delved in Exchange 2013 with it.
So what is the get-serverhealth command good for?
For starters, let’s look at the command and its output:
get-serverhealth -identity ds-l1-e2k13-01 |ft
server,currenthealthsetstate,name,healthsetname,alertvalue,healthgroupname -auto
This produces the following output – Server Health XLXS file.
What you will notice is that there are several Health Groups listed: Customer Touch Points, Service Components, Server
Resources and Key Dependencies. Within each Health Group is a grouping of Health Sets.
Customer Touch Points
Service Components
Server Resources
Key Dependencies
Now what can we learn from these Health Sets?
From the above screenshot you can see that there are two issues with the server, having to do with AutoDiscover, and both
are set to Unhealthy. Microsoft has a set of troubleshooting steps for errors that come up in these health checks:
AutodiscoverSelfTestMonitor is unhealthy
AutoDiscover is unhealthy
If we were to examine one of these, the AutoDiscoverCTPMonitor, we can see that the ‘AutoDiscover is unhealthy’ article
directory references our exact issue:
If
we
were
to
run
the
“Invoke-MonitoringProbe
Autodiscover\AutodiscoverCtpProbe
-Server
server1.contoso.com | Format-List” command, we would hope to generate the error that this probe has detected.
From the error it seems that there I have not configured my External EWS directory. This can be confirmed in PowerShell:
Indeed, the error seems to be generated because I had not configured an external URL for EWS. I then set my EWS virtual
directory
After this change was made, I’ve generated a new error:
As this server was meant for testing, multiple URLs are not configured and thus would lead me down a rabbit hole so to
speak to remove all the issues. So as you can see, these monitors are very useful for determining what is wrong with your
server.
Related Resources
Health Sets
All Health Set Troubleshooters
What Did Managed Availability Just Do To This Service?
We in the Exchange product group get this question from time to time. The first thing we ask in
response is always, “What was the customer impact?” In some cases, there is customer impact;
these may indicate bugs that we are motivated to fix. However, in most cases there was no customer
impact: a service restarted, but no one noticed. We have learned while operating the world’s largest
Exchange deployment that it is fantastic when something is fixed before customers even notice.
This is so desirable that we are willing to have a few extra service restarts as long as no customers
are impacted.
You can see this same philosophy at work in our approach to database failovers since Exchange
2007. The mantra we have come to repeat is, “Stuff breaks, but the user experience doesn’t!” User
experience is our number one priority at all times. Individual service uptime on a server is a less
important goal, as long as the user experience remains satisfactory.
However, there are cases where Managed Availability cannot fix the problem. In cases like these,
Exchange provides a huge amount of information about what the problem might be. Hundreds of
things are checked and tested every minute.
Usually, Get-HealthReport and Get-ServerHealth will be sufficient to find the problem, but this
blog post will walk you through getting the full details from an automatic recovery action to the
results of all the probes by:
1. Finding the Managed Availability Recovery Actions that have been executed for a given
service.
2. Determining the Monitor that triggered the Responder.
3. Retrieving the Probes that the Monitor uses.
4. Viewing any error messages from the Probes.
Finding Recovery Actions
Every time Managed Availability takes a recovery action, such as restarting a service or failing over
a
database,
it
logs
an
event
in
the Microsoft.Exchange.ManagedAvailability/RecoveryActions crimson channel. Event 500
indicates that a recovery action has begun. Event 501 indicates that the action that was taken has
completed. These can be collected via the MMC Event Viewer, but we usually find it more useful to
use PowerShell. All of these Managed Availability recovery actions can be collected in PowerShell
with a simple command:
$RecoveryActionResultsEvents = Get-WinEvent –ComputerName <Server> -LogName
Microsoft-Exchange-ManagedAvailability/RecoveryActionResults
We can use the events in this format, but it is easier to work with the event properties if we use
PowerShell’s native XML format:
$RecoveryActionResultsXML = ($RecoveryActionResultsEvents | Foreach-object -Process
{[XML]$_.toXml()}).event.userData.eventXml
Some of the useful properties for this Recovery Action event are:



Id: The action that was taken. Common values
are RestartService, RecycleAppPool, ComponentOffline, or ServerFailover.
State: Whether the action has started (event 500) or finished (event 501).
ResourceName: The object that was affected by the action. This will be the name of a service
for RestartService actions, or the name of a server for server-level actions.
EndTime: The time the action completed.
Result: Whether the action succeeded or not.
 RequestorName: The name of the Responder that took the action.
So for example, if you wanted to know why MSExchangeRepl was restarted on your server around
9:30PM, you could run a command like this:


$RecoveryActionResultsXML | Where-Object {$_.State -eq “Finished” -and $_.ResourceName –
eq "MSExchangeRepl" -and $_.EndTime -like "2013-06-12T21*"}| ft -AutoSize
StartTime,RequestorName
This results in the following output:

StartTime

RequestorName

---------

-------------

2013-05-12T21:49:18.2113618Z 
ServiceHealthMSExchangeReplEndpointRestart
The RequestorName property indicates the name of the Responder that took the action. In this
case, it was ServiceHealthMSExchangeReplEndpointRestart. Often, the responder name will
give you an indication of the problem. Other times, you will want more details.
Finding the Monitor that Triggers a Responder
Monitors are the central part of Managed Availability. They are the primary means, through GetServerHealth and Get-HealthReport, by which an administrator can learn the health of a server.
Recall that a Health Set is a grouping of related Monitors. This is why much of our
troubleshooting documentation is focused on these objects. It will often be useful to know what
Monitors and Health Sets are repeatedly unhealthy in your environment.
Every
time
the
Health
Manager
service
starts,
it
logs
events
to
the Microsoft.Exchange.ActiveMonitoring/ResponderDefinition crimson channel, which we
can use to get the properties of the Responders we found in the last step by
the RequestorName property. First, we need to collect the Responders that are defined:
$DefinedResponders = (Get-WinEvent –ComputerName <Server> -LogName MicrosoftExchange-ActiveMonitoring/ResponderDefinition | %
{[xml]$_.toXml()}).event.userData.eventXml
One of these Responder Definitions will match the Recovery Action’s RequestorName. The Monitor
that controls the Responder we are interested in is defined by the AlertMask property of that
Definition. Here are some of the useful Responder Definition properties:
TypeName: The full code name of the recovery action that will be taken when this Responder
executes.
 Name: The name of the Responder.
 TargetResource: The object this Responder will act on.
 AlertMask: The Monitor for this Responder.
 WaitIntervalSeconds: The minimum amount of time to wait before this Responder can be
executed again. There are other forms of throttling that will also affect this Responder.
To get the Monitor for the ServiceHealthMSExchangeReplEndpointRestart Responder, you run:

$DefinedResponders | ? {$_.Name –eq “ServiceHealthMSExchangeReplEndpointRestart”} |
ft -a Name,AlertMask
This results in the following output:

Name

AlertMask

----

---------

ServiceHealthMSExchangeReplEndpointRestart 
ServiceHealthMSExchangeReplEndpointMonitor
Many Monitor names will give you an idea of what to look for. In this case,
the ServiceHealthMSExchangeReplEndpointMonitor Monitor does not tell you much more than
the Responder name did. The Technet article on Troubleshooting DataProtection Health Set lists
this Monitor and suggests running Test-ReplicationHealth. However, you can also get the exact
error messages of the Probes for this Monitor with a couple more commands.
Finding the Probes for a Monitor
Remember
that
Monitors
have
their
definitions
written
to
the Microsoft.Exchange.ActiveMonitoring/MonitorDefinition crimson channel. Thus, you can
get these in a similar way as the Responder definitions in the last step. You can run:
$DefinedMonitors = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-ExchangeActiveMonitoring/MonitorDefinition | % {[xml]$_.toXml()}).event.userData.eventXml
Some useful properties of a Monitor definition are:
Name: The name of this Monitor. This is the same name reported by Get-ServerHealth.
 ServiceName: The name of the Health Set for this Monitor.
 SampleMask: The substring that all Probes for this Monitor will have in their names.
 IsHaImpacting: Whether this Monitor should be included when HaImpactingOnly is specified
by Get-ServerHealth or Get-HealthReport.
To get the SampleMask for the identified Monitor, you can run:

($DefinedMonitors | ? {$_.Name -eq
‘ServiceHealthMSExchangeReplEndpointMonitor’}).SampleMask
This results in the following output:

ServiceHealthMSExchangeReplEndpointProbe
Now that we know what Probes to look for, we can search the Probes’ definition channel. Useful
properties for Probe Definitions are:
Name: The name of the Probe. This will begin with the SampleMask of the Probe’s Monitor.
 ServiceName: The Health Set for this Probe.
 TargetResource: The object this Probe is validating. This is appended to the Name of the Probe
when it is executed to become a Probe Result ServiceName.
 RecurrenceIntervalSeconds: How often this Probe executes.
 TimeoutSeconds: How long this Probe should wait before failing.
To get definitions of this Monitor’s Probes, you can run:

(Get-WinEvent –ComputerName <Server> -LogName Microsoft-ExchangeActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?
{$_.Name -like “ServiceHealthMSExchangeReplEndpointProbe*”} | ft -a Name,
TargetResource
This results in the following output:
Remember, not all Monitors use synthetic transactions via Probes. See this blog post for the other
ways Monitors collect their information.
This Monitor has three Probes that can cause it to become Unhealthy. You’ll see that they are
named such that each is named with the Monitor’s SampleMask, but are then differentiated. When
getting the Probe Results in the next step, the Probes will also have the TargetResource in
their ServiceName.
Now that we know all the Probes that could have failed, but we don’t yet know which did or why.
Getting Probe Error Messages
There are many Probes and they execute often, so the channel where they are logged
(Microsoft.Exchange.ActiveMonitoring/ProbeResult) generates a lot of data. There will often
only be a few hours of data, but the Probes we are interested in will probably have a few hundred
Result entries. Here are some of the Probe Result properties you may be interested in for
troubleshooting:
ServiceName: The Health Set of this Probe.
 ResultName: The Name of this Probe, including the Monitor’s SampleMask, an identifier of the
code this Probe executes, and the resource it verifies. The target resource is appended to the
Probe’s name we found in the previous step. In this example, we
append /MSExchangeRepl to
get ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExchangeRepl.
 Error: The error returned by this Probe, if it failed.
 Exception: The callstack of the error, if it failed.
 ResultType: An integer that indicates one of these values:
 1: Timeout
 2: Poisoned
 3: Succeeded
 4: Failed
 5: Quarantined
 6: Rejected
 ExecutionStartTime: When the Probe started.
 ExecutionEndTime: When the Probe completed.
 ExecutionContext: Additional information about the Probe’s execution.
 FailureContext: Additional information about the Probe’s failure.
Some Probes may use some of the other available fields to provide additional data about failures.

We can use XPath to filter the large number of events to just the ones we are interested in; those
with the ResultName we identified in the last step and with a ResultType of 4 indicating that they
failed:
$replEndpointProbeResults = (Get-WinEvent –ComputerName <Server> -LogName MicrosoftExchange-ActiveMonitoring/ProbeResult -FilterXPath
“*[UserData[EventXML[ResultName=’ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExcha
ngeRepl’][ResultType=’4′]]]” | % {[XML]$_.toXml()}).event.userData.eventXml
To get a nice graphical view of the Probe’s errors, you can run:
$replEndpointProbeResults | select -Property *Time,Result*,Error*,*Context,State* |
Out-GridView
In this case, the full error message for both Probe Results suggests making sure the
MSExchangeRepl service is running. This actually is the problem, as for this scenario I restarted the
service manually.
Summary
This article is a detailed look at how you have access to an incredible amount of information about
the health of Exchange Servers. Hopefully, you will not often need it! In most cases, the alerts will
be enough notification and the included cmdlets will be sufficient for investigation.
Managed Availability is built and hardened at scale, and we continuously analyze these same events
collected in this article so that we can either fix root causes or write Responders to fix more
problems before users are impacted. In those cases where you do need to investigate a problem in
detail, we hope this post is a good starting point.
Da <https://blogs.technet.microsoft.com/exchange/2013/06/13/what-did-managed-availability-just-do-to-this-service/>
Customizing Managed Availability
Exchange Server 2013 introduces a new feature called Managed Availability, which is a built-in monitoring system with selfrecovery capabilities.
If you’re not familiar with Managed Availability, it’s a good idea to read these posts:


Lessons from the Datacenter: Managed Availability
What Did Managed Availability Just Do To This Service?
As described in the above posts, Managed Availability performs continuous probing to detect possible problems with
Exchange components or their dependencies, and it performs recovery actions to make sure the end user experience is not
impacted due to a problem with any of these components.
However, there may be scenarios where the out-of-box settings may not be suitable for your environment. This blog post
guides you on how to examine the default settings and modify them to suit your environment.
Managed Availability Components
Let’s start by finding out which health sets are on an Exchange server:
Get-HealthReport -Identity Exch2
This produces the output similar to the following:
Next, use Get-MonitoringItemIdentity to list out the probes, monitors, and responders related to a health set. For
example, the following command lists the probes, monitors, and responders included in the FrontendTransport health set:
Get-MonitoringItemIdentity -Identity FrontendTransport -Server exch1 | ft name,itemtype –
AutoSize
This produces output similar to the following:
You might notice multiple probes with same name for some components. That’s because Managed Availability creates a
probe for each resource. In following example, you can see that OutlookRpcSelfTestProbe is created multiple times (one
for each mailbox database present on the server).
Use Get-MonitoringItemIdentity to list the monitoring Item Identities along with the resource for which they are created:
Get-MonitoringItemIdentity -Identity Outlook.Protocol -Server exch1 | ft
name,itemtype,targetresource –AutoSize
Customize Managed Availability
Managed Availability components (probes, monitors and responders) can be customized by creating an override.
There are two types of override: local override and global override. As their names imply, a local override is available only
on the server where it is created, and a global override is used to deploy an override across multiple servers.
Either override can be created for a specific duration or for a specific version of servers.
Local Overrides
Local overrides are managed with the *-ServerMonitoringOverride set of cmdlets. Local overrides are stored under
following registry path:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides\
The Microsoft Exchange Health Management service reads this registry path every 10 minutes and loads configuration
changes. Alternatively, you can restart the service to make the change effective immediately.
You would usually create a local override to:


Customize a managed availability component that is server-specific and not available globally; or
Customize a managed availability component on a specific server.
Global Overrides
Global overrides are managed with the *-GlobalMonitoringOverride set of cmdlets. Global overrides are stored in the
following container in Active Directory:
CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=Contoso,DC=com
Get Configuration Details
The configuration details of most of probes, monitors, and responders are stored in the respective crimson channel event
log for each monitoring item identity, you would examine these first before deciding to change.
In this example, we will explore properties of a probe named “OnPremisesInboundProxy”, which is part of the
FrontendTransport Health Set. The following script lists detail of the OnPremisesInboundProxy probe:
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundProxy"}
You can also use Event Viewer to get the details of probe definition. The configuration details most of the Probes are stores
in the ProbeDefinition channel:
1. Open
Event
Viewer,
and
then
expand
Applications
and
Services
Logs\Microsoft\Exchange\ActiveMonitoring\ProbeDefinition.
2. Click on Find, and then enter OnPremisesInboundProxy.
3. The General tab does not show much detail, so click on the Details tab, it has the configuration details specific to
this probe. Alternatively, you can copy the event details as text and paste it into Notepad or your favorite editor to
see the details.
Override Scenarios
Let’s look at couple real-life scenarios and apply our learning so far to customize managed availability to our liking, starting
with local overrides.
Creating a Local Override
In this example, an administrator has customized one of the Inbound Receive connectors by removing the binding of
loopback IP address. Later, they discover that the FrontEndTransport health set is unhealthy. On further digging, they
determine that the OnPremisesInboundProxy probe is failing.
To figure out why the probe is failing, you can first list the configuration details of OnPremisesInboundProxy probe.
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundProxy"}
Name : OnPremisesInboundProxy
WorkItemVersion : [null]
ServiceName : FrontendTransport
DeploymentId : 0
ExecutionLocation : [null]
CreatedTime : 2013-08-06T12:54:29.7571195Z
Enabled : 1
TargetPartition : [null]
TargetGroup : [null]
TargetResource : [null]
TargetExtension : [null]
TargetVersion : [null]
RecurrenceIntervalSeconds : 300
TimeoutSeconds : 60
StartTime : 2013-08-06T12:54:36.7571195Z
UpdateTime : 2013-08-06T12:48:27.1418660Z
MaxRetryAttempts : 1
ExtensionAttributes :
<ExtensionAttributes><WorkContext><SmtpServer>127.0.0.1</SmtpServer><Port>2
5</Port><HeloDomain>InboundProxyProbe</HeloDomain><MailFrom
Username="inboundproxy@contoso.com"/><MailTo ..Select="All"
Username="HealthMailboxdd618748368a4935b278e884fb41fd8a@FM.com"/><Data
AddAttributions="false">X-Exchange-Probe-Drop-Message:FrontEnd-CAT-250
Subject:Inbound proxy
probe</Data><ExpectedConnectionLostPoint>None</ExpectedConnectionLostPoint>
</WorkContext></ExtensionAttributes>
The ExtentionAttributes property above shows that the probe is using 127.0.0.1 for connecting to port 25. As that is the
loopback address, the administrator needs to change the SMTP server in ExtentionAttributes property to enable the probe
to succeed.
You use following command to create a local override, and change the SMTP server to the hostname instead of loopback
IP address.
Add-ServerMonitoringOverride -Server ServerName -Identity
FrontEndTransport\OnPremisesInboundProxy -ItemType Probe -PropertyName
ExtensionAttributes -PropertyValue
'<ExtensionAttributes><WorkContext><SmtpServer>Exch1.contoso.com</SmtpServer><Port>25</Po
rt><HeloDomain>InboundProxyProbe</HeloDomain><MailFrom
Username="inboundproxy@contoso.com" /><MailTo Select="All"
Username="HealthMailboxdd618748368a4935b278e884fb41fd8a@FM.com" /><Data
AddAttributions="false">X-Exchange-Probe-Drop-Message:FrontEnd-CAT-250Subject:Inbound
proxy
probe</Data><ExpectedConnectionLostPoint>None</ExpectedConnectionLostPoint></WorkContext>
</ExtensionAttributes>' -Duration 45.00:00:00
The probe will be created on the specified server in following registry path:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides\Probe
You can use following command to verify if the probe is effective:
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundProxy"}
Creating a Global Override
In this example, the organization has an EWS application that is keeping the EWS app pools busy for complex queries. The
administrator discovers that the EWS App pool is recycled during long running queries, and that the EWSProxyTestProbe
probe is failing.
To find out the details of EWSProxyTestProbe, run the following:
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | %
{[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "EWSProxyTestProbe"}
Next, change the timeout interval for EWSProxyTestProbe to 25 seconds on all servers running Exchange Server 2013 RTM
CU2:
Use following command to get version information for Exchange 2013 RTM CU2 servers:
Get-ExchangeServer | ft name,admindisplayversion
1. Use following command to create a new global override:
Add-GlobalMonitoringOverride -Identity "EWS.Proxy\EWSProxyTestProbe" -ItemType Probe
-PropertyName TimeoutSeconds -PropertyValue 25 –ApplyVersion “15.0.712.24”
Override Durations
Either of above mentioned overrides can be created for a specific Duration or for a specific Version of Exchange servers.
The override created with Duration parameter will be effective only for the period mentioned. And maximum duration that
can be specified is 60 days. For example, an override created with Duration 45.00:00:00 will be effective for 45 days since
the time of creation.
The Version specific override will be effective as long as the Exchange server version matches the value specified. For
example, an override created for Exchange 2013 CU1, with version “15.0.620.29” will be effective until the Exchange server
version changes. The override will be ineffective if the Exchange server is upgraded to different version of Cumulative
Update or Service Pack.
Hence, if you need an override in effect for longer period, create the override using ApplyVersion parameter.
Removing an Override
Finally, this last example shows how to remove the local override that was created for the OnPremisesInbounxProxy probe.
Remove-ServerMonitoringOverride -Server ServerName -Identity
FrontEndTransport\OnPremisesInboundProxy -ItemType Probe -PropertyName
ExtensionAttributes
Conclusion
Managed Availability performs gradual recovery actions to automatically recover from failure scenarios. The overrides
help you customize the configuration of the Managed Availability components to suite to your environment. The steps
mentioned in the document can be used to customize Monitors & Probes as required.
Special thanks to Abram Jackson, Scott Schnoll, Ben Winzenz, and Nino Bilic for reviewing this post.
Bhalchandra Atre
Da <http://blogs.technet.com/b/exchange/archive/2013/08/13/customizing-managed-availability.aspx>
Managed availability in Exchange 2013
The release of Exchange 2013 brought us another gem to the precious set of Exchange functionalities, Managed Availability is
also known as Active Monitoring or Local Active Monitoring (LAM). Briefly speaking, it is an in-built Exchange monitoring
system, which automatically analyses mail server components, and in case of any detected errors or corruptions attempts to fix
them (e.g. switches a mailbox database to another server, etc.)
Monitoring Exchange server health and performance using
managed availability
The structure of Managed Availability consists of three components:
1. Probing
2. Monitoring
3. Responder
Probing carries out a multiple tests on particular mail server services (e.g. client protocols, storage data services responsible
for mail flow, data migration or storing). They are carried out on the basis of:



performance tests – they verify the response value of a particular service according to predefined performance
thresholds (e.g. what is the response time of Exchange ActiveSync service),
health tests – they check the status of a service (active or not responding),
exception tests – they verify if there are any exceptional events in running services.
The probe configuration cannot be changed. Probing may generate itself an analysis report (e.g. in a form of an entry in
event logs) or forward it to the monitoring component. Results of probing can be found in Event Viewer:
Event Viewer -> Applications and Services Logs -> Microsoft -> Exchange -> ActiveMonitoring –ProbeResult
Monitoring is the core component and holds a decisive role in the Managed Availability structure. It is responsible for the
data analysis (gathered by the probe component) and determines the action to be taken on a monitored service or
Exchange component, what results in creating notifications in Event Logs, or sending an information to the Responder
component to execute a command (e.g. a service restart). Monitor surveys the state of a particular Exchange component
and may indicate the following states:






healthy state – is indicated when gathered data presents information on no anomalies regarding a monitored
Exchange component,
unhealthy state – is indicated when there is a problem concerning a monitored component,
degraded state – is shown when the monitor indicates inappropriate behavior of a service within the time limit of 60
seconds,
disabled state – when a monitor is disabled as a result of administrative actions,
unavailable – a monitor is unable to analyze a component or service,
repairing – happens when Managed Availability is attempting to repair a component.
A responder is responsible for taking actions on components analyzed by the monitoring component. Such actions include: a
service or server restart, entries in event logs, IIS reset, switching of a mailbox to a different database or databases to a different
server, turning a service offline or online what may result in rejection or acceptance of client requests by a service.
The physical structure
In a strictly technical sense Managed Availability is based on two processes:
1. msexchangehmworker.exe – this process monitors the state of Exchange 2013 components
2. msexchangehmhost.exe (Exchange Health Manager Service) – it manages worker processes
The second process (msexchangehwhost.exe) is more important, if it goes down, the whole Managed Availability
component will also go down. The screenshot below presents both processes in Task Manager:
Microsoft doesn’t recommend to turn off any of the Managed Availability components, as it may limit the availability of
some elements or affect the whole Exchange 2013 server system. However, there may be a situation that we would like to
turn off one of the Managed Availability’s functionalities (e.g. in case that we may suspect that it somehow affects
performance and stability of our server). We shouldn’t do it by terminating the Exchange Health Manager Service, but by
using the cmdlet called Set-ServerComponentState. For example to turn off the monitoring feature in Managed
Availability, we need to execute the command below:
Set-ServerComponentState -Identity <server_name> -Component Monitoring -Requester Functional State Inactive
Overrides
As it was mentioned before, the Monitoring component analyses data gathered during the process of probing. The analysis
is based on the comparison of the gathered data results with the predefined thresholds (of certain service checks), what
demarks the line between correct or incorrect service behaviors. In case a particular component is recognized as working
improperly on the basis of analysis, an appropriate log is recorded in event logs, or a specific action is forwarded to
Responder, which attempts to reclaim healthy state of a malfunctioned service. However, there is a possibility to change
the predefined thresholds and actions that are sent to responder. We can set values that would fit to our Exchange 2013
environment. The changed values are called Overrides. For example, Cumulative Update installation to Exchange server
may cause that during probing some services will be incorrectly informing about their current state. Usually, the simplest
way to reclaim the proper state of services probing is to restart all monitored services. In this case setting non-standard
Override values will restart services when the monitoring component receives information on their improper behavior.
Override values can be set globally for the entire Exchange organization, or locally for a single server. The local override
configuration is hold in local server registry:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides\
The global override configuration can be found in Active Directory:
CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=Example,DC=com
Whether we want to configure overrides on global or local level, we should use two following cmdlets:
Add-GlobalMonitoringOverride
Add-ServerMonitoringOverride
Log entries
Apart from its basic function of repairing Exchange services, Managed Availability also holds (in registries) logs of probing,
monitoring and responder actions. These registries can be found here:
Event Viewer -> Applications and Services Logs -> Microsoft -> Exchange -> Active Monitoring
Event Viewer -> Applications and Services Logs -> Microsoft -> Exchange -> Managed Availability
In ActiveMonitoring we can find all information on probes, monitors and responders configuration, and also results of their
activity. Managed Availability holds information about all undertaken repair attempts by this component.
Health Mailboxes
In Exchange 2013, Managed Availability uses so called Health Mailboxes in order to carry out simulations of users’ actions
like sending or receiving messages. Each Active Directory account is associated with these mailboxes. Health Mailboxes
implementation have been evolving together with Exchange 2013. Before Cumulative Update 6 there was only one Health
Mailbox installed for each database and Client Access Server (CAS). The appearance of Cumulative Update 6 caused that
for one database installed on a Mailbox Server there is one Health Mailbox, however, for each Client Server there are now
10 Health Mailboxes. These mailboxes are kept in the Active Directory container:
In order to display Health Mailboxes in PowerShell, we type in the following cmdlet:
Get-mailbox -monitoring | ft name,database
Health Mailbox is a simple mailbox which is associated with an Active Directory user. The display name of a user with
associated Health Mailbox that belongs to CAS server:
HealthMailbox-CAS_server_name-consecutive_mailbox_number_within_range_001-010
The screenshot below illustrates an example:
The display name for an Active Directory user with a Health Mailbox associated with a database:
HealthMailbox-server_name_MBX-data_base_name
An example:
In everyday work with Health Mailboxes there may be two scenarios that would require administrator’s intervention. The
scenario number one is a corrupted Health Mailbox. It may appear when a database associated with this mailbox is deleted
by an administrator; and a user account that refers to such mailbox becomes “orphaned” due to no connection to any
object. The best solution is to delete this orphaned account in Active Directory and to restart the Health Manager service.
Another scenario requiring administrator’s attention are lockouts on users Active Directory accounts associated with Health
Mailboxes. Whenever, an account is locked out, Managed Availability is not able to perform any tests which involve
simulations of Exchange users’ actions. A lockout is a result of Password and Account Lockout Policies of an organization,
and is put on accounts associated with Health Mailboxes installed in Monitoring Mailboxes container. Passwords to these
accounts are changed by Health Mailbox Worker and consist of 128-digit-length signs, which in some cases may not fulfill
passwords policy, what will result in lockouts of these accounts (accordingly with Account Lockout Policies). That is why
Microsoft recommends not to include any accounts (contained in Monitoring Mailbox) in passwords policies. What’s
more, it is better not to:







move users from Monitoring Mailboxes to other containers or organizational units,
change account properties in Monitoring Mailboxes,
disable accounts (in Monitoring Mailboxes) in organizations’ Passwords and Account Lockout Policies,
change inheritance on AD objects,
move Health Mailboxes between databases,
put Health Mailboxes in quotes
in case of retention policies, delete data in Health Mailboxes before at least 30 days.
The usage of Managed Availability
Type in the following command in Exchange Management Shell (EMS) to verify the status of particular components in
Exchange organizations:
Get-HealthReport –Identity Exchange_server_name
As we can observe in the screenshot below, Get-HealthReport displays the status of some of the HealthSets. A
single health set is a list of probes, monitors and responders, organized into logical set which addresses a particular service
or component in Exchange server.
In order to show all health sets with the Unhealthy status, execute the following command:
Get-HealthReport -Identity server_name | Where-Object {$_.AlertValue -eq ‘Unhealthy’}
The displayed HealthSet called MailboxTransport is shown as Unhealthy. We want to check which one of the monitors
reports this status using the command below:
Get-ServerHealth -Identity server_name –HealthSet healthSet_name
The monitor called Mapi.Submit.Monitor is the one responsible for the status of the health set which refers
to MailboxTransport.
To verify the configuration of Mapi.Submit.Monitor, we should display records placed in event logs
called ActiveMonitoring/MonitorDefinition. We may look for this data through events logs graphical interface or just
simply use the following command (recommended):
(Get-WinEvent -ComputerName server_name -LogName Microsoft-ExchangeActiveMonitoring/MonitorDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -eq
" monitor_name"}
Now we want to check which probe feeds the data into this monitor, to do it we should explore the value of the property
called SampleMask. In this case it is Mapi.Submit.Probe. Next, using the event logs we extract all error logs concerning
this particular probe (Mail.Submit.Probe). To achieve this we will use this cmdlet:
$errRecords = (Get-WinEvent -ComputerName domainA-mail -LogName Microsoft-ExchangeActiveMonitoring/ProbeResult -FilterXPath
"*[UserData[EventXML[ResultName='Name/ResourceType'][ResultType='4']]]" | %
{[XML]$_.toXml()}).event.userData.eventXml
We will need Name/ResourceType to the above command. Let’s use:
Get-MonitoringItemIdentity -Identity HealthSet_name -server server_name | select
HealthSetName,Name,TargetResource,ItemType
The Get-MonitoringItemIdentity cmdlet allows to display probes, monitors and responders associated to a particular
health set.
The given screenshot proves that the Name/ResourceType format will become Mapi.Submit.Probe as this probe is not
associated with any ResourceType. Therefore, the cmdlet that gathers all error logs from event logs connected with
Mapi.Submit.Probe will look like this:
$errRecords = (Get-WinEvent -ComputerName domainA-mail -LogName Microsoft-ExchangeActiveMonitoring/ProbeResult -FilterXPath
"*[UserData[EventXML[ResultName='Mapi.Submit.Probe'][ResultType='4']]]" | %
{[XML]$_.toXml()}).event.userData.eventXml
In order to display the error that caused the TransportMailbox health set is Unhealthy, we should filter
the $errRecords with the following cmdlet:
$errRecords | select -Property *time,result*,error*,*context
The above screenshot informs that issues are caused by the delays between the Store and Submission components during
the test sending of a message.
Let’s check what is the repair method undertaken by Managed Availability. It is important to check which responder is
connected with Mapi.Submit.Monitor. In this case let’s use the cmdlet:
(Get-WinEvent –ComputerName server_name -LogName Microsoft-ExchangeActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml | ?
{$_.AlertMask -like “*monitor_name*”} | fl
Name,AlertMask,EscalationSubject,EscalationMessage,UpdateTime
The responder we were looking for is Mapi.Submit.EscalateResponder as suggested by the screenshot above. This type of
responder (Escalate) doesn’t make Managed Availability to undertake any automatic repairs, but is responsible for log
notifications in event logs.
The bottom line
Managed Availability is a powerful component that ensures automatic monitoring, appropriate log entries and repairing of
improperly working components and services in Exchange 2013. After the installation of Exchange server, Managed
Availability doesn’t require any configuration to work. However, there are situations which require administrators to
change the default settings in order to neutralize the improper reporting and automatic attempts to repair services and
components which work properly. Managed Availability processes huge amount of data what makes it hard for an
administrator to extract specific information. As we have proved it can be done (not easily though). Such analysis of data
helps in better understanding of monitoring processes, what they consist of, and most importantly what to do when
Exchange 2013 starts to work improperly.
Suggested reading:




Read about repairing mailboxes in Exchange with New-MailboxRepairRequest
How to manage Role Based Access Control in Exchange 2013?
User mailbox and shared mailbox auditing in Exchange 2013
Check out CodeTwo software for Exchange and Office 365
Managed availability in Exchange 2013 by Adam the 32-bit Aardvark
Da <https://www.codetwo.com/admins-blog/managed-availability-in-exchange-2013/>
Managed Availability HealthSet Troubleshooting
Knowing how to deal with annoying HealthSet's was scary at the beginning of my experience with Exchange 2013.
The new introduced feature called Managed Availability, is a built-in monitoring system that can take recovery actions that
can cause serious issues trying to solve a small issue.
There are 3 types of components that can be related to HealthSet's


Probe : used to determine if Exchange components are active
Monitor : when probes signal a different state then the one stored in the patters of the monitoring engine, monitoring
engine will determine whether a component or feature is unhealthy.
 Responder : will take action if a monitor will alert the responder about an unhealthy state. Responders will take
different actions depending on the type of component or feature. Actions can start with just recycling the application
pool and can go to as far as restarting the server or even worse putting the server offline so it won't accept any
connections.
In this Blog post I will talk about troubleshooting different HealthSet's in Exchange 2013.
Troubleshooting Exchange HealthSet MailboxSpace
We should start with getting a health report for the Exchange 2013 server by using the Get-HealthReport cmdlet
Get-HealthReport -Identity EXCH2K13
Image 1
If you want to list only those HealthSet's that are Unhealthy, Degraded, Disabled you can use this cmdlet :
Get-HealthReport -Server EXCH2K13| where { $_.alertvalue -ne "Healthy" }
Let's list a couple of these components for the MailboxSpace HealthSet
Get-MonitoringItemIdentity -Identity MailboxSpace -Server EXCH2K13 | ft
Identity,ItemType,TargetResource -autosize
Image 2
As you can see the HealthSet has multiple Probes, Monitors, Responders.
What if you have a HealthSet with status Unhealthy or Repairing like the MailboxSpace for a Test DB ?
We need to investigate further to check what are the monitors that are causing the HealthSet to go into Unhealthy state.
Get-ServerHealth -Identity EXCH2K13 -HealthSet "MailboxSpace"
Image 3
As you can see above, a lot of Monitors are Unhealthy.
Assuming that in your production environment you have a TEST DB located on C: drive that you probably don't want to
move or delete but because of limited space available on C you are getting these Unhealthy monitors.
We can use Add-ServerMonitoringOverride to disable these monitors.
Add-ServerMonitoringOverride
http://technet.microsoft.com/en-us/library/jj218628(v=exchg.150).aspx
The limitations for this is the 60 days limit for a server override
Add-ServerMonitoringOverride -Duration 60.00:00:00 -Identity ProbeMonitorResponderName ItemType Monitor -PropertyName Enabled -PropertyValue 0
Using the result we got in Image2 with Get-MonitoringItemIdentity and combining that with Get-ServerHealth we will
identify Monitors that need to be overridden.
We have the following Monitors with Unhealthy or Repairing state :
MailboxSpace\DatabaseLogicalPhysicalSizeRatioEscalationNotification\DB01
MailboxSpace\DatabaseLogicalPhysicalSizeRatioEscalationNotification\DB02
MailboxSpace\DatabaseLogicalPhysicalSizeRatioEscalationProcessingMonitor
MailboxSpace\DatabaseSizeMonitor\DB01
MailboxSpace\DatabaseSizeMonitor\DB02
MailboxSpace\Stora-PrgeLogicalDriveSpaceMonitor\C:
Add-ServerMonitoringOverride -ItemType Monitor -Identity
"MailboxSpace\DatabaseLogicalPhysicalSizeRatioEscalationNotification\DB01" -PropertyValue
0 -PropertyName Enabled -Duration "60.00:00:00" -Server EXCH2K13
Add a server override for all the Monitors above, please make sure of the ItemType if it's Probe, Monitor, Responder
At the end you can verify your Server Overrides with Get-ServerMonitoringOverride
Image 4
Now we should check ServerHealth to see if the Monitors have been disabled
Get-ServerHealth -Identity EXCH2K13 -HealthSet "MailboxSpace" | ft -Autosize
Image 5
MailboxSpace HealthSet is Healthy now.
Image 6
Troubleshooting FEP HealthSet
Some of you don't have ForeFront installed so you would want to disable this HealthSet on the server.
We will achieve this simply by changing the xml file that corresponds to FEP Health set
Browse to C:\Program Files\Microsoft\Exchange\V15\Bin\Monitoring\
Search for FEPActiveMonitoringContext. Open the file with Notepad
Change Line 12 : Enabled = “True”
Replace TRUE with FALSE to disable FEP monitoring.
The file should look something like this :
<?xml version="1.0" encoding="iso-8859-1"?>
< Definition xsi:noNamespaceSchemaLocation="..\..\WorkItemDefinition.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<!–FEPService Maintenance definition section–>
<MaintenanceDefinition
AssemblyPath="Microsoft.Exchange.Monitoring.ActiveMonitoring.Local.Components.dll"
TypeName="Microsoft.Exchange.Monitoring.ActiveMonitoring.FEP.FEPDiscovery"
Name="FEP.Maintenance.Workitem"
ServiceName="FEP"
RecurrenceIntervalSeconds="0"
TimeoutSeconds="30"
MaxRetryAttempts="0"
Enabled = "false">
After you modify the above line you should restart Microsoft Exchange Health Management service on the server where
you modified the xml file
Troubleshooting CAS Proxy HealthSet's
What if you have TMG in your Organization and you need to set OWA/ECP with Basic Authentication
You will probably disable Forms Authentication on OWA and ECP
Soon after you have disabled forms Authentication you will start seeing that some server components will go in inactive
state like OWA.Proxy, ECP.Proxy , RWS.Proxy
You can check with : Get-ServerComponentState -Identity EXCH2K13
Image 7
We can set the component back to Active manually by running this cmdlet :
Set-ServerComponentState -Identity EXCH2K13 -Component EcpProxy -State Active -Requester
HealthAPI
After 1 hour the components will return to an Inactive state.
If we continue forward with troubleshooting and check Crimson Logs on your server you will find events related to
ECP.Proxy Probe.
More information about Crimson channel event logging can be found here
http://technet.microsoft.com/en-us/library/dd351258(v=exchg.150).aspx#Crimson
Event Viewer > Application and Services Logs > Microsoft > Exchange > ActiveMonitoring > ProbeResult
Find the event related to Probe Result (Name=ECPProxyTestPRobe/MSExchangeECPAppPool) select Details and at
StateAttribute3 you will see
"FailurePoint=FrontEnd,HttpStatusCode=401,Error=Unauthorized,Details=,HttpProxySubErrorCo
de=,WebExceptionStatus=,LiveIdAuthResult="
ECP.Proxy Probe is failling with 401 Unauthorized error, credential used can be seen at StateAtrribute2
Verify HealthSet for ECP and OWA
Get-HealthReport -Server EXCH2K13
You will see the ECP,OWA,ECP.Proxy,OWA.Proxy,RWS Proxy HealthSet's are Unhealthy
To remove this behavior we can disable the Monitoring Probes for OWA, ECP , RWS
Open Windows Explorer and browse to :
C:\Program Files\Microsoft\Exchange Server\V15\Bin\Monitoring\Config\
Open ClientAccessProxyTest.xml with Notepad
Change the "true" value of the following Monitoring Probes
ECPProbeEnabled = "false"
OWAProbeEnabled = "false"
ReportingProbeEnabled = "false"
Save the ClientAccessProxyTest.xml and close it
Restart Microsoft Exchange Health Manager on the server where you modified xml file
Disabling the Monitoring Probes has no impact on the Exchange Servers Proxy functionality.
If you want to modify any other settings to the xml files locate in Bin\Monitoring\Config\ please consult a Microsoft
Exchange Support Engineer before doing any modifications to those files.
To conclude this the problem is with the Authentication method used on the IIS sites ECP, OWA.
Monitoring Probes can only use Forms Based Authentication and Windows Authentication to test ECP , OWA , RWS
functionality.
I hope the information provided was helpful for you.
If you have any questions please feel free to send an email to a-crtimo@microsoft.com.
Tags ECP.Proxy Exchange 2013 HealthSet Managed Availability OWA.Proxy
Da <https://blogs.technet.microsoft.com/ehlro/2014/02/20/exchange-2013-managed-availability-healthset-troubleshooting/>
Managed Availability and Server Health
Every second on every Exchange 2013 server, Managed Availability polls and analyzes hundreds of health
metrics. If something is found to be wrong, most of the time it will be fixed automatically. But of course
there will always be issues that Managed Availability won’t be able to fix on its own. In those cases,
Managed Availability will escalate the issue to an administrator by means of event logging, and perhaps
alerting if System Center Operations Manager is used in tandem with Exchange 2013. When an
administrator needs to get involved and investigate the issue, they can begin by using the GetHealthReport and Get-ServerHealth cmdlets.
Server Health Summary
Start with Get-HealthReport to find out the status of every Health Set on the server:
Get-HealthReport –Identity <ServerName>
This will result in the following output (truncated for brevity):
Server
State
HealthSet
AlertValue
LastTransitionTime
MonitorCount
——
——
——
——
——
——
Server1
NotApplicable
AD
Healthy
5/21/2013 12:23
14
Server1
NotApplicable
ECP
Unhealthy
5/26/2013 15:40
2
Server1
NotApplicable
EventAssistants
Healthy
5/29/2013 17:51
40
Server1
NotApplicable
Monitoring
Healthy
5/29/2013 17:21
9
…
…
…
…
…
…
In the above example, you can see that that the ECP (Exchange Control Panel) Health Set is Unhealthy.
And based on the value for MonitorCount, you can also see that the ECP Health Set relies on two
Monitors. Let’s find out if both of those Monitors are Unhealthy.
Monitor Health
The next step would be to use Get-ServerHealth to determine which of the ECP Health Set Monitors are
in an unhealthy state.
Get-ServerHealth –Identity <ServerName> –HealthSet ECP
This results in the following output:
Server State
Name
TargetResour HealthSetNa
ce
me
AlertValu ServerCompone
e
nt
——
——
——
——
——
——
——
Server NotApplicab EacSelfTestMonito
1
le
r
ECP
Unhealth None
y
Server NotApplicab EacDeepTestMonit
1
le
or
ECP
Unhealth None
y
As you can see above, both Monitors are Unhealthy. As an aside, if you pipe the above command to
Format-List, you can get even more information about these Monitors.
Troubleshooting Monitors
Most Monitors are one of these four types:
The EacSelfTestMonitor Probes along the "1" path, while the EacDeepTestMonitor Probes along the "4"
path. Since both are unhealthy, it indicates that the problem lies on the Mailbox server in either the
protocol stack or the store. It could also be a problem with a dependency, such as Active Directory, which
is common when multiple Health Sets are unhealthy. In this case, the Troubleshooting ECP Health
Set topic would be the best resource to help diagnose and resolve this issue.
Da <https://blogs.technet.microsoft.com/exchange/2013/06/26/managed-availability-and-server-health/>
Managed Availability Probes
Probes are one of the three critical parts of the Managed Availability framework (monitors and
responders are the other two). As I wrote previously, monitors are the central components, and you can
query monitors to find an up-to-the-minute view of your users’ experience. Probes are how monitors
obtain accurate information about that experience.
There are three major categories of probes: recurrent probes, notifications, and checks.
Recurrent Probes
The most common probes are recurrent probes. Each probe runs every few minutes and checks some
aspect of service health. They may transmit an e-mail to a monitoring mailbox using Exchange
ActiveSync, connect to an RPC endpoint, or establish CAS-to-Mailbox server connectivity. All of these
probes are defined in the Microsoft.Exchange.ActiveMonitoring\ProbeDefinition event log channel each
time the Exchange Health Manager service is started. The most interesting properties for these events
are:
Name: The name of the Probe. This will begin with the SampleMask of the Probe’s Monitor.
 TypeName: The code object type of the probe that contains the probe’s logic.
 ServiceName: The name of the Health Set for this Probe.
 TargetResource: The object this Probe is validating. This is appended to the Name of the Probe
when it is executed to become a Probe Result ResultName
 RecurrenceIntervalSeconds: How often this Probe executes.
 TimeoutSeconds: How long this Probe should wait before failing.
On a typical Exchange 2013 multi-role server, there are hundreds of these probes defined. Many probes
are per-database, so this number will increase quickly as you add databases. In most cases, the logic in
these probes is defined in code, and not directly discoverable. However, there are two probe types that
are common enough to describe in detail, based on the TypeName of the probe:

Microsoft.Exchange.Monitoring.ActiveMonitoring.ServiceStatus.Probes.GenericServiceProb
e: Determines whether the service specified by TargetResource is running.
 Microsoft.Exchange.Monitoring.ActiveMonitoring.ServiceStatus.Probes.EventLogProbe:
Logs an error result if the event specified by ExtensionAttributes.RedEventIds has occurred in
the ExtensionAttributes.LogName.
Success
results
are
logged
if
the ExtensionAttributes.GreenEventIds is logged. These probes will not work if you override them to
watch for a different event.
The basics of a recurrent probe are as follows: start every RecurrenceIntervalSeconds and check (or probe)
some aspect of component health. If the component is healthy, the probe passes and writes an
informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult channel with
a ResultType of 3. If the check fails or times out, the probe fails and writes an error event to the same
channel. A ResultType of 4 means the check failed and a ResultType of 1 means that it timed out. Many
probes will re-run if they timeout, up to the MaxRetryAttempts property.

The ProbeResult channel gets very busy with hundreds of probes running every few minutes and logging
an event, so there can be a real impact on the performance of your Exchange server if you perform
expensive queries against this event channel in a production environment.
Notifications
Notifications are probes that are not run by the health manager framework, but by some other service
on the server. These services perform their own monitoring, and then feed data into the Managed
Availability framework by directly writing probe results. You will not see these probes in the
ProbeDefinition channel, as this channel only describes probes that are run within the Managed
Availability framework.
For example, the ServerOneCopyMonitor Monitor is triggered by Probe results written by the
MSExchangeDagMgmt service. This service performs its own monitoring, determines whether there is a
problem, and logs a probe result. Most Notification probes have the capability to log both a red event
that turns the Monitor Unhealthy and a green event that make the Monitor healthy once more.
Checks
Checks are probes that only log events when a performance counter passes above or below a defined
threshold. They are really a special type of Notification probe, as there is a service monitoring the
performance counters on the server and logging events to the ProbeResult channel when the configured
threshold is met.
To find the counter and threshold that is considered unhealthy, you can look at Monitor Definitions with
a Type property of:
· Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueAboveThreshold
Monitor or
· Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueBelowThreshold
Monitor
This means that the probe the Monitor watches is a Check probe.
How this works with Monitors
From the Monitor’s perspective, all three probe types are the same as they each log to the ProbeResult
channel. Every Monitor has a SampleMask property in its definition. As the Monitor executes, it looks for
events in the ProbeResult channel that have a ResultName that matches the Monitor’s SampleMask.
These events could be from recurrent probes, notifications, or checks. If the Monitor’s thresholds are
reached or exceeded, it becomes Unhealthy.
It is worth noting that a single probe failure does not necessarily indicate that something is wrong with
the server. It is the design of Monitors to correctly identify when there is a real problem that needs fixing
versus a transient issue that resolves itself or was anomalous. This is why many Monitors have thresholds
of multiple probe failures before becoming Unhealthy. Even many of these problems can be fixed
automatically by Responders, so the best place to look for problems that require manual intervention is
in the Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel. These events sometimes
also include the most recent probe error message (if the developers of that Health Set view it as relevant
when they get paged with that event’s text in Office 365).
There are more details on how Monitors work, and how they can be overridden to use different
thresholds in the Managed Availability Monitors article.
Abram Jackson
Program Manager, Exchange Server
Da <https://blogs.technet.microsoft.com/exchange/2014/08/11/managed-availability-probes/>
Managed Availability Monitors
Monitors are the central component of Managed Availability. They define what data to collect, what
constitutes the health of a feature, and what actions to take to restore a feature to good health. Because
there are several different aspects to Monitors, it can be hard to figure out how a specific Monitor works.
All of the properties discussed in this article can be found in the Monitor’s definition event in the
Microsoft.Exchange.ActiveMonitoring\MonitorDefinition crimson channel are of the Windows event log.
See this article for how these definitions can be easily collected.
What Data is Collected?
Nearly all Monitors collect one of three types of data: direct notifications, Probe results, or performance
counters. Monitors that change states based on a direct notification only get data from the notification.
Monitors based on Probe results become unhealthy when some Probes fail. There are two main types
of these Monitors, those based on a number of consecutive Probe failures, and those based on a number
of Probes failing over an interval.
Monitors based on performance counters simply determine if a counter is higher or lower than the builtin defined threshold for the required time.
The TypeName property of a Monitor definition indicates what data it is collecting and the kind of
threshold must be reached before it is considered Unhealthy. Here are the most common types with
what they use:
OverallPercentSuccessMonitor
Looks at the results of all probes matching
the SampleMask property and calculates the
aggregate
percent
success
over
the
past MonitoringIntervalSeconds. Becomes Unhealthy
if the calculated percent success is less than
the MonitoringThreshold.
OverallConsecutiveProbeFailuresMonitor
Looks at the last X probe results as configured
in MonitoringThreshold that match the SampleMask.
Becomes Unhealthy if all of those results are failures.
OverallXFailuresMonitor
Looks at the results of all probes matching
the SampleMask property
over
the
past MonitoringIntervalSeconds. Becomes Unhealthy
if
at
least
X
results
as
configured
in MonitoringThreshold are failures.
OverallConsecutiveSampleValueAboveThreshol
dMonitor
Looks at the last X performance counter results as
configured
in SecondaryMonitoringThreshold matching Sample
Mask over
the
past MonitoringIntervalSeconds.
Becomes Unhealthy if at least X performance
counters are above the threshold configured
in MonitoringThreshold.
Healthy or Not
One more thing must happen before the Monitor will become Unhealthy. The code for individual
Monitors that checks the threshold only runs every X seconds, where X is specified by
the RecurrenceIntervalSeconds property. The threshold is checked only when the Monitor runs.
As soon as the Monitor runs while the threshold is met, the Monitor becomes Unhealthy. GetServerHealth will report that the Monitor is Degraded for the first 60 seconds, but the functional
behavior of the Monitor does not have a concept of being Degraded; it is either Healthy or Unhealthy.
The Health Set that a Monitor is part of is defined by the Monitor’s ServiceName property. If any Monitor
is Unhealthy, the entire Health Set will be marked as Unhealthy as viewed from Get-HealthReport or
via System Center Operations Manager (SCOM).
Responder Timeline
The StateTransitionXML property of a Monitor definition indicates which Responders execute and when,
as each Responder is tied to a transition state of the Monitor. Let’s consider a Monitor that has this value
for its StateTransitionXML property:
<StateTransitions>
<Transition
<Transition
<Transition
<Transition
ToState="Unhealthy" TimeoutInSeconds="0" />
ToState="Unhealthy1" TimeoutInSeconds="30" />
ToState="Unhealthy2" TimeoutInSeconds="330" />
ToState="Unrecoverable" TimeoutInSeconds="1500" />
</StateTransitions>
As soon as the Monitor runs while its defined threshold is met, it will transition to the “Unhealthy” state.
These transition states are only used for internal consumption. Although they share a term, the Monitor
can only be Healthy or Unhealthy from an external perspective. Any Responders set to execute when
this Monitor is in this transition state will now execute. After 30 more seconds, any Responders set to
execute when the Monitor is in the “Unhealthy1” state will now execute. The next Responder will be 300
seconds later (for a total of 330 seconds) when the Monitor is set to the “Unhealthy2” state. The
transition state each Responder is tied to is set by the TargetHealthState property on a Responder
definition, which is an integer. Here are the transition states that the integer indicates:
0
None
1
Healthy
2
Degraded
3
Unhealthy
4
Unrecoverable
5
Degraded1
6
Degraded2
7
Unhealthy1
8
Unhealthy2
9
Unrecoverable1
10
Unrecoverable2
We call all these Responders that are tied to a Monitor transition states a Responder chain. As a Monitor’s
threshold continues to be met, stronger and stronger Responders execute until the Monitor determines
it is Healthy or an administrator is notified via event log escalation. If the code for this Monitor runs while
it is in the “Unhealthy1” state and the threshold is no longer met, the Monitor will immediately transition
to None. No more Responders will execute. Get-ServerHealth would again report this Monitor as
Healthy.
Abram Jackson
Program Manager, Exchange Server
Da <https://blogs.technet.microsoft.com/exchange/2013/07/16/managed-availability-monitors/>
Managed Availability Responders
Responders are the final critical part of Managed Availability. Recall that Probes are
how Monitors obtain accurate information about the experience your users are receiving.
Responders are what the Monitors use to attempt to fix the situation. Once they pass throttling,
they launch a recovery action such as restarting a service, resetting an IIS app pool, or anything else
the developers of Exchange have found often resolve the symptoms. Refer to the Responder
Timeline section of the Managed Availability Monitors article for information about when the
Responders are executed.
Definitions and Results
Just like Probes and Monitors, Responders have an event log channel for their definitions and
another for their results. The definitions can be found in Microsoft-ExchangeActiveMonitoring/ResponderDefinition. Some of the important properties are:






TypeName: The full code name of the recovery action that will be taken when this Responder
executes.
Name: The name of the Responder.
ServiceName: The HealthSet this Responder is part of.
TargetResource: The object this Responder will act on.
AlertMask: The Monitor for this Responder.
ThrottlePolicyXml: How often this Responder is allowed to execute. I’ll go into more details in the
next section.
The results can be found in Microsoft-Exchange-ActiveMonitoring/ResponderResult. Responders
output a result on a recurring basis whether or not the Monitor indicates they should take a
recovery action. If a ResponderResult event has a RecoveryResult of 2 and IsRecoveryAttempted of
1, the Responder attempted a recovery action. Usually, you will want to instead skip looking at the
Responder
results
and
go
straight
to
Microsoft-ExchangeManagedAvailability/RecoveryActionResults, but let’s first discuss the events in the MicrosoftExchange-ManagedAvailability/RecoveryActionLogs event log channel.
Throttling
When a recovery action is attempted by a Responder, it is first checked against throttling limits.
This will result in one of two events in the RecoveryActionLogs channel: 2050, throttling has allowed
the operation, or 2051, throttling rejected the operation. Here’s a sample of a 2051 event:
In the details, you will see:
ActionId
RestartService
ResourceName
MSExchangeRepl
RequesterName
ServiceHealthMSExchangeReplEndpointRestart
ExceptionMessage
Active Monitoring Recovery action failed. An operation was rejected during
local throttling. (ActionId=RestartService, ResourceName=MSExchangeRepl,
Requester=ServiceHealthMSExchangeReplEndpointRestart,
FailedChecks=LocalMinimumMinutes, LocalMaxInDay)
LocalThrottleResult
<LocalThrottlingResult IsPassed="false" MinimumMinutes="60"
TotalInOneHour="1" MaxAllowedInOneHour="-1" TotalInOneDay="1"
MaxAllowedInOneDay="1" IsThrottlingInProgress="true"
IsRecoveryInProgress="false" ChecksFailed="LocalMinimumMinutes,
LocalMaxInDay" TimeToRetryAfter="2015-02-11T14:29:57.9448377-08:00">
<MostRecentEntry
Requester="ServiceHealthMSExchangeReplEndpointRestart" StartTime="201502-10T14:29:55.9920032-08:00" EndTime="2015-02-10T14:29:57.944837708:00" State="Finished" Result="Succeeded" /> </LocalThrottlingResult>
GroupThrottleResult
<not attempted>
TotalServersInGroup
0
TotalServersInCompatibleVersion 0
Hopefully, you recognize the first few fields. This is the RestartService recovery action, which
restarts a service. The ResourceName is used by the recovery action to pick a target; for
the RestartService recovery action, it is the name of the service to restart. The RequesterName is
the name of the Responder, as listed in the ResponderDefinition or ResponderResult channels.
The LocalThrottleResult property is more interesting. Recovery actions are throttled per server,
where the same recovery action cannot run too often on the same server, and per group, where
the same recovery action cannot run too often on the same DAG (for the Mailbox role) or AD site
(for the Client Access role). If a value is -1, this level of throttling is not used; for
example, MaxAllowedInOneHour is not interesting if only 1 action is allowed per day. In this
example, the MSExchangeRepl resource was already the target of a recovery action within the last
60 minutes, and so the recovery action did not pass the LocalMinimumMinutes throttling. As this
recovery action attempt was blocked by local throttling, the group throttling was not attempted.
This table has a description of each of the limits mentioned in this event:
ThrottlingResult
attribute
IsPassed
Local throttle config
attribute name
Group throttle config
attribute name
Description
True if throttling will allow the
recovery action. Otherwise, false.
LocalMinimumMinutesBet GroupMinimumMinute The time that must elapse before
this recovery action may act upon
weenAttempts
sBetweenAttempts
LocalMinimumMinutes,
the same resource on this server
or in this group.
GroupMinimumMinute
s
MinimumMinutes,
TotalInOneHour
The number of times this
recovery action has acted upon
this resource on this server or in
this group in the last hour.
MaxAllowedInOneHour LocalMaximumAllowedAtt n/a
,
emptsInOneHour
The number of times this
recovery action is allowed to act
upon this resource on this server
or in this group in one hour.
LocalMaxInHour
The number of times this
recovery action has acted upon
this resource on this server or in
this group in the last 24 hours.
TotalInOneDay
MaxAllowedInOneDay,
LocalMaxInDay,
GroupMaxInDay
LocalMaximumAllowedAtt GroupMaximumAllowe The number of times this
recovery action is allowed to act
emptsInADay
dAttemptsInADay
upon this resource on this server
or in this group in 24 hours.
IsRecoveryInProgress,
RecoveryInProgress,
GroupRecoveryInProgr
ess
TimeToRetryAfter
Whether this recovery action is
already acting upon this resource
and has not completed. If True,
the new action will be aborted.
The time after which this recovery
action would be allowed to act on
this resource on this server or in
this group.
The GroupThrottleResult has the same fields, and also gives details about the recovery actions
that have taken place on the other servers in the group.
If the action is not throttled, event 500 will be logged in the Microsoft-ExchangeManagedAvailability/RecoveryActionResults channel, indicating that the recovery action is
beginning. If it succeeds, event 501 is logged. This is the most common case and where you’ll
usually want to start. These events also have details about the recovery action that was taken and
the throttling it passed. Recovery actions that start and then fail are still counted against throttling
limits. For more information about recovery actions, read the What Did Managed Availability Just
Do to This Service? article.
Viewing Throttling Limits
So what is the best way to find out what recovery action throttling is in place? You could wait for
the Responder to begin a recovery action and view the throttling settings in
the RecoveryActionsLogs channel, but there are two places that will be more timely. The first is
the Microsoft-Exchange-ManagedAvailability\ThrottlingConfig event log channel. The second is
the Microsoft-Exchange-ActiveMonitoring/ResponderDefinition channel, introduced in the first
section of this artcile. The advantage of the ThrottlingConfig channel is that you can see all the
Responders that can take a particular recovery action grouped together, instead of having to check
every Responder definition. Here’s a sample event from the ThrottlingConfig event log channel:
Identity
RestartService/Default/*/*/msexchangefastsearch
RecoveryActionId
RestartService
ResponderCategory
Default
ResponderTypeName *
ResponderName
*
ResourceName
msexchangefastsearch
PropertiesXml
<ThrottleConfig Enabled="True" LocalMinimumMinutesBetweenAttempts="60"
LocalMaximumAllowedAttemptsInOneHour="-1"
LocalMaximumAllowedAttemptsInADay="4" GroupMinimumMinutesBetweenAttempts="1" GroupMaximumAllowedAttemptsInADay="-1" />
The Identity of a throttling configuration is a concatenation of the next five fields, so let’s discuss
each. The RecoveryActionId is the Responder’s throttling type. You can find this as the name of
the ThrottleEntries node
in
the
Responder
definition’s ThrottlePolicyXml property.
The ResponderCategory is unused and is always Default right now. The ResponderTypeName is
the Responder’s TypeName property. The ResourceName is the object the Responder acts on. In
this example, the throttling for Responders that use the RestartService recovery action to restart
the MSExchangeFastSearch process are allowed on any server up to 4 times a day, as long as it has
been 60 minutes since this recovery action has restarted it on that server. The group throttling is
not used.
The second method to view throttling limits is by the Microsoft-ExchangeActiveMonitoring/ResponderDefinition events. This will include any overrides you have in place. Here
is the value of the ThrottlePolicyXml property from a ResponderDefinition event:
<ThrottleEntries> <RestartService ResourceName="MSExchangeFastSearch"> <ThrottleConfig
Enabled="True" LocalMinimumMinutesBetweenAttempts="60"
LocalMaximumAllowedAttemptsInOneHour="-1" LocalMaximumAllowedAttemptsInADay="4"
GroupMinimumMinutesBetweenAttempts="-1" GroupMaximumAllowedAttemptsInADay="-1" />
</RestartService> </ThrottleEntries>
You can see that these attribute names and values match
the ThrottlingConfig event’s PropertiesXml values.
Changing Throttling Limits
There may be times when you want recovery actions to occur more frequently or less frequently.
For example, you have a customer report of an outage and you find that a service restart would
have fixed it but was throttled, or you have a third-party application that does particularly poorly
with application pool resets. To change the throttling configuration, you can use the same AddServerMonitoringOverride and Add-GlobalMonitoringOverride cmdlets that work for other
Managed Availability overrides. The Customizing Managed Availability article gives a good
summary on using these cmdlets. For the PropertyName parameter, the cmdlet supports a special
syntax for modifying the throttling configuration. Instead of specifying the entire XML blob as the
override (which will work, but will be harder to read later), you can
use ThrottleAttributes.LocalMinimumMinutesBetweenAttempts,
the PropertyName. Here’s an example:
or
the
other
properties,
as
Add-GlobalMonitoringOverride -ItemType Responder -Identity
Search\SearchIndexFailureRestartSearchService –PropertyName
ThrottleAttributes.LocalMinimumMinutesBetweenAttempts -PropertyValue 240 -ApplyVersion
"15.00.1044.025"
To only allow app pool resets by the ActiveSyncSelfTestRestartWebAppPool Responder every 2
hours instead of 1, you could use the command:
Add-GlobalMonitoringOverride -ItemType Responder -Identity
ActiveSync.Protocol\ActiveSyncSelfTestRestartWebAppPool -PropertyName
ThrottleAttributes.LocalMinimumMinutesBetweenAttempts -PropertyValue 120 -ApplyVersion
“Version 15.0 (Build 1044.25)”
If you want you servers to reboot when the MSExchangeIS service crashes and cannot start at the
rate of all of your servers once a day and no more often than one in the DAG every 60 minutes, you
could use the commands:
Add-GlobalMonitoringOverride -ItemType Responder -Identity Store\StoreServiceKillServer PropertyName ThrottleAttributes.GroupMinimumMinutesBetweenAttempts -PropertyValue 60 ApplyVersion “15.00.1044.025”
Add-GlobalMonitoringOverride -ItemType Responder -Identity Store\StoreServiceKillServer PropertyName ThrottleAttributes.GroupMaximumAllowedAttemptsInADay -PropertyValue -1 ApplyVersion “15.00.1044.025”
The LocalMaximumAllowedAttemptsInADay value is already 1, so each server would still reboot
at
most
once
per
day.
If
the
override
was
entered
correctly,
the ResponderDefinition event’s ThrottlePolicyXml value will be updated, and there will be a new
entry in the ThrottlingConfig channel.
These may be poor examples, but it is hard to pick good ones as the Exchange developers pick
values for the throttling configuration based on our experience running Exchange in Office 365. We
don’t expect that changing these values is going to be something you’ll want to do very often, but
it is usually a better idea than disabling a monitor or a recovery action altogether. If you do have a
scenario where you need to keep a throttling limit override in place, we would love to hear about
it.
Abram Jackson
Program Manager, Exchange Server
Tags Exchange 2013 managed availability On premises Tips 'n Tricks
Da <https://blogs.technet.microsoft.com/exchange/2015/03/02/managed-availability-responders/>
Server Component States in Exchange 2013
Introduction
In Exchange 2013 we introduced the concept of “Server Component States”. Server Component State
provides granular control over the state of the components that make up an Exchange Server from the
point of view of the environment it is running in. This is useful when you want to take an Exchange 2013
Server out of operation partially or completely for a limited time, but still need the Exchange services on
the server to be up and running.
One example of such a situation is when “Managed Availability” (MA) comes to the conclusion that a
specific server is not healthy in some respects and therefore should be bypassed temporarily until the
bottleneck has been identified and removed. MA does so by utilizing so-called “Offline Responders”.
They are explained in some detail in Lessons from the Datacenter: Managed Availability. Their
counterpart are “Online Responders”, which bring the server back online when it is determined as being
healthy again.
Another example is when a server is being updated, for example with a new CU.
In both situations, the server cannot be taken offline completely, but it also should not be considered as
a fully operational member of its Exchange organization as well.
The primary purpose of Managed Availability is, to make the life of Exchange Administrators easier so
that they usually do not have to bother themselves with the details. However, in other situations, a certain
level of knowledge about the basic concepts behind “Server Component States” might prove to be
useful.
Verify the current State of the Server Components
A first overview over the current State of all Server Components can be displayed in the Exchange
PowerShell with the Get-ServerComponentState –Identity <ServerID> cmdlet:
You see, that the Server Components listed here do not map 1:1 to Exchange Services or processes
running on the server. Instead, they provide an abstraction-layer and display “Components” which
together make up the interfaces an Exchange Server provides to its environment. The majority of the
components have a name like “*Proxy”. They are specific for the CAS Role, while other components like
“HubTransport” and “UMCallRouter” are part of the Mailbox server role and “Monitoring” and
“RecoveryActionsEnabled” belong to both roles.
In addition to the single components which can be managed individually, there’s also a component
called “ServerWideOffline”, which is used to manage the state of all components together, with the
exception of “Monitoring” and “RecoveryActionsEnabled”. For this purpose, “ServerWideOffline”
overrides individual settings for all other components. It doesn’t touch “Monitoring” and
“RecoveryActionsEnabled” because these two components need to stay active in order to keep MA
going. Without them, no “OnlineResponder” could bring “ServerWideOffline” back to “Active”
automatically.
About States and Requesters
Usually, Server Components are in one of two States: “Active” or “Inactive”. A third state, called
“Draining”, is only relevant for the Components “FrontendTransport” and “HubTransport”.
Whenever the state of a component is supposed to be changed, it has to be done by a “Requester”. For
example, the parameter –Requester is mandatory when you run the cmdlet Set-ServerComponentState:
There are altogether five “Requesters” defined:
HealthAPI
 Maintenance
 Sidelined
 Functional
 Deployment
Requesters are labels – you can choose any of them when running Set-ServerComponentState. But each
Requester is treated and stored individually (more on this in the next section) and you should select the
Requester that best matches your intention. For example, when you need to set “ServerWideOffline” to
“Inactive” for maintenance purposes, it makes no sense to use “HealthAPI” as Requester. You might get
what you want that way in terms of functionality; but such a choice will make troubleshooting
unnecessarily complicated in case something does not work as expected, and you might get into a
conflict with MA. Whenever MA triggers an OfflineResponder or an OnlineResponder it uses “HealthAPI”
as Requester; therefore it is a good idea to consider “HealthAPI” as reserved for use by MA.

Interaction between States and Requesters
As stated above, each Requester is handled and stored individually. There’s no relationship or hierarchy
amongst them. However, in case of a conflict between two or more Requesters, “Inactive” has a higher
priority than “Active”.
Here’s a practical example:
Imagine that “ServerWideOffline” has been set to “Inactive” by two different Requesters, say “Functional”
and “Maintenance”:
Then, you set back “ServerWideOffline” to “Active” with one of the two Requesters
As a Result, “ServerWideOffline” and all dependent Components still remain in the state “Inactive”:
In order to set them back to “Active” again, Set-ServerComponentState … -State Active needs to be
executed with the second Requester as well.
Obviously, administrators will rarely configure such combinations purposefully. However, we have seen
them happening as the result of a mix of processes running in the background and manual configuration.
Where is the data stored and how can it be retrieved?
Information about “Server Components”, “Requesters” and “States” is stored in two different places:
Active Directory and the server’s Registry. Storage in the Active Directory facilitates running SetServerComponentState against a remote server.
In order to determine precedence in case of a divergence between the two places, a timestamp is used.
The newer setting is considered as the intended one.
In Active Directory, the settings are stored in the “msExchComponentStates” attribute of the Exchange
Server object in the Configuration-namespace:
In
the
Registry,
the
settings
are
stored
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ServerComponentStates:
under
You can use the Get-ServerComponentState cmdlet from the Shell to retrieve these settings:
<variable>.LocalStates displays the settings in the local Registry and <variable>.RemoteStates displays
the settings in the Active Directory.
Peculiarities of
Components
“FrontendTransport”
and
“HubTransport”
Most Exchange Server components pick up changes in their Component State “on the fly”. However, this
is not the case for the two Components “FrontendTransport” (mapped to the “Microsoft Exchange
Frontend Transport” service on CAS Servers) and “HubTransport” (mapped to “Microsoft Exchange
Transport” on Mailbox Servers). They pick up changes upon their next restart only.
This can cause confusion about their actual state. For example, they might be displayed as “Inactive”
in Get-ServerComponentState but functionally still be active since they haven’t been restarted since their
state has changed.
In order to alert the Administrator about such an inconsistency they write warnings into the Application
Event
Log.
These
warnings
(7011
for
MSExchangeTransport
and
7012
for
MSExchangeFrontEndTransport) inform the Administrator about the current and the expected state:
MA has Responders which take care of such inconsistencies and resolve them after a while by a forced
crash and restart of the affected services. These Responders have the name
“FrontendTransport.ServiceInconsistentState.Restart.Responder”
and
“HubTransport.ServiceInconsistentState.Restart.Responder”.
They
can
be
identified
in
the Microsoft.Exchange.ActiveMonitoring/ResponderDefinition crimson
channel
using
the
methodology described in What Did Managed Availability Just Do To This Service? blog post.
In Exchange 2013 CU2 and CU3 they are only triggered once per day for standalone Mailbox Servers
and up to four times per day for DAG members. This might change in future versions, though.
After the inconsistencies have been cleared, Events 7009 from Source “MSExchangeTransport” (or
“MSExchangeFrontendTransport”) and Category “Components” are logged, which show the current
state:
When the state of one or both of these components is set to “Inactive”, each attempt to connect to the
SMTP service on the server (on TCP port 25) triggers the response “421 4.3.2 Service not active” (for the
FrontendTransport component) and “451 4.7.0 Temporary Server error” (for the HubTransport
component) and corresponding entries are written to the respective SmtpReceive Protocol Logs.
As mentioned in KB 2866822 , messages sent from internal stay in the Outbox or Drafts folder and
“Service not active” entries are logged in the “Connect*”-Logs in the “Submission”-Folder underneath
“TransportRoles\Logs\Mailbox\Connectivity\” when “HubTransport” is set to “Inactive”.
A problem with failing updates to CU2
While an Exchange 2013 Server is updated with CU2, the setup- sets “Monitoring”,
“RecoveryActionsEnabled” and “ServerWideOffline” to Inactive using the Requester “Functional” at the
beginning, as can be seen in the “ExchangeSetup”-Logfile:
However, when the update exits prematurely because it encounters an unrecoverable error-condition, it
does not restore the original state. Even when the Administrator restarts all stopped Exchange services
or reboots the server, the Exchange components still remain in the Inactive state.
In order to recover from this situation, you must either find the root cause for the error and remove it
so that the setup completes successfully, or manually set the ServerComponentStates back to Active
with the Requester “Functional”.
This issue might be fixed in future CUs and SPs.
When should you change Server Component States manually?
Perhaps the two most important scenarios where manual changes of Server Component States come
into play are:
1. Planned server maintenance
2. Temporary isolation of some server components, so that they are not targets for proxy requests
from other CAS servers any more.
Server component states and planned maintenance of DAG
members
Scenario 1 is described for DAG-members in some detail in Managing Database Availability Groups on
TechNet. However, in practice, it can prove to be trickier than expected, depending on the details of the
planned maintenance measure.
The article does not mention that “FrontendTransport” and “HubTransport” have to be restarted, before
a change in their Component State becomes effective at all. Without a restart, you continue to receive
warning events 7011 and 7012 in the Application Event Log, but the state of the components remains
the same as before, until eventually MA detects the inconsistency and force restarts the services.
Also, the problem with the prematurely exiting CU setups mentioned above in the section “Practical
Experiences” cannot be resolved by following the recommendations in the article to the dot, because
Setup uses the Requester “Functional”, while the article talks about the Requester “Maintenance”.
Special thanks to Bhalchandra Atre, Stephen Gilbert for their contributions as well as Abram Jackson and
Bharat Suneja for their review.
Da <https://blogs.technet.microsoft.com/exchange/2013/09/26/server-component-states-in-exchange-2013/>
Exchange Server 2013 Managed Availability
When Microsoft introduced Managed Availability in Exchange Server 2013, its appearance sowed quite
a bit of confusion in the Exchange world. The idea behind it—an automated system of health monitoring
that would watch critical components of your Exchange infrastructure and automatically take corrective
action to fix problems as they occurred—sounded great, but at its launch, Managed Availability was
poorly documented and largely misunderstood. Now that Exchange 2013 has been out in the field for a
while, both Microsoft and its customers are getting more operational experience with Managed
Availability. Understanding how it works and why it works that way will help you understand how
Managed Availability will affect your operating procedures and how to manage it to get the desired
outcomes. Note that you'll sometimes see references to Active Monitoring (AM) and Local Active
Monitoring (LAM) in the Managed Availability world. They're functional descriptions of the feature,
not real names, but the acronyms haven't been completely removed from the code base, event log
messages, and so on.
Defining the Data Center Downward
Microsoft, IBM, and many other enterprise-focused companies have long sought to build systems—in
the form of hardware, operating systems, and applications—that are resilient against failure or damage.
The goal of these efforts has been to bring mainframe-quality uptime to enterprise applications without
requiring the overhead and infrastructure required by these traditional systems.
We've all reaped the benefits. The redundant server hardware that's now almost a commodity used to
be found only in extremely demanding, budget-insensitive applications such as spaceflight, telephone
switching, and industrial control systems. Likewise, applications such as Microsoft SQL Server, Oracle's
database applications, and Microsoft Exchange have steadily gained more resiliency-focused features,
including clustering, transactional database logging, and a variety of application-specific protection
methods (e.g., Safety Net and shadow redundancy in Exchange). In general, these features focus on
detecting certain types of failures and automatically taking action to resolve them, such as activating a
database copy on another server. However, the next logical step in building resiliency into Exchange
required a departure from the previous means of doing so. Exchange needed more visibility into more
components of the system, as well as an expanded set of actions that it can take.
Advanced service monitoring relies on three complementary tasks. It has to monitor the state of every
interesting component, decide whether the data returned from monitoring indicates some type of
problem, then act to resolve the problem. This monitor-decide-act process has long been the province
of human administrators. You notice that something is wrong (perhaps as the result of a user report or
your own monitoring), you figure out what the problem is, then take one or more actions to fix the
problem. However, having humans responsible for that process doesn't scale well to very large
environments, such as the Exchange Online portion of Microsoft Office 365. Plus, it's tiresome for the
unlucky administrator who gets stuck having to fix problems on weekends and holidays. To address
these shortcomings and give Exchange more capability to self-diagnose and self-repair, Microsoft
delivered Managed Availability as part of Exchange 2013.
Understanding Managed Availability's Logical Design
Managed Availability is designed around three logical components: probe, monitor, and responder. The
probe runs tests against different aspects of Exchange. These tests can be performance based (e.g., how
long it takes for a logon transaction in Outlook Web App—OWA—to work), health based (e.g., whether
a particular service is currently running), or exception based (e.g., a monitored component generated
a bug check or another unusual event that indicates an immediate problem).
In Microsoft's words, the monitor "contains all of the business logic used by the system based on what
is considered healthy on the data collected." This is a fancy way of saying that the monitor is responsible
for interpreting data gathered by the probe to determine whether any action is required. Because the
monitor's behavior is specified by the same developers who wrote the code for each monitored
component, the monitor has intimate knowledge of how to tell whether a particular component is
healthy.
Managed Availability uses the terms "healthy" and "unhealthy" in pretty much the same sense as people
do. If a component is healthy, its performance and function are normal. An unhealthy component is
one that the monitor has decided isn't working properly. In addition to the basic healthy and unhealthy
states used by the monitor, there are other states that might appear when you check the state of a server.
(I'll discuss those later.)
The responder takes action based on what the monitor finds. These actions can range from restarting a
service to running a bug check on a server (thus forcing it to reboot) to logging an event in the event
log. Although logging an event might not seem like a very forceful step for the responder to take, the
idea is that the logged event will be picked up by a monitoring system such as Microsoft System Center
Operations Manager or Dell OpenManage, thus alerting a human, who can then take some kind of
action.
Understanding Managed Availability's Physical Design
Managed Availability is implemented by a pair of services that you'll find on every Exchange 2013
server: MSExchangeHMWorker.exe and MSExchangeHMHost.exe. MSExchangeHMWorker.exe is the
worker process that does the actual monitoring. MSExchangeHMHost.exe (shown as the Exchange
Health Manager Service in the Services snap-in) manages the worker process. The pairing of a
controller service with one or more worker processes is used throughout other parts of Exchange, such
as the Unified Messaging (UM) feature. You don't necessarily need to monitor the state of the worker
processes on your servers. However, you should monitor MSExchangeHMHost because if
MSExchangeHMHost isn't running, then no health monitoring will be performed on that server.
Microsoft doesn't support disabling the Managed Availability subsystem by stopping the
MSExchangeHMHost process. If Managed Availability is doing something you don't like, you should
tame it with the techniques I'll show you instead of turning it off completely.
The configuration settings for Managed Availability live in multiple places. This can be a little confusing
until you get used to it. The local settings for an individual server are stored in the registry at
HKLM\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides.
When
you
override a monitor on an individual server (which I'll discussed later), the override settings are stored
in that server's registry. Other settings, notably global overrides, are stored in Active Directory (AD) in
the Monitoring Settings container (CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft
Exchange,CN=Services,CN=Configuration…) All the standard caveats about the need for healthy AD
replication thus apply to the proper functioning of Managed Availability.
Running Your First Health Check
The simplest way to start understanding Managed Availability is to use the Get-HealthReport cmdlet.
You can use it to learn about the states of all the health sets on the specified server. A health set is a
group of probes, monitors, and responders for a component. For example, if you want to check the
health sets for the server named WBSEXMR02, you'd run the command:
Get-HealthReport -Identity WBSEXMR02
When you run this command, you'll see output similar to that shown in Figure 1. The deluge of
information will probably tell you more than you wanted to know about what Managed Availability
thinks of the target server.
Figure 1: Obtaining the States of All the Health Sets on the Current Server
An Exchange component such as the Exchange ActiveSync (EAS) subsystem might have multiple health
sets associated with it, and each health set might contain multiple probes, monitors, and responders
that assess different aspects of items in the health set. One great example is OutlookRpcSelfTestProbe,
which has one probe for each mailbox database on the server. All of those probes roll up into the
Outlook.Protocol health set. Each reported health set includes a state (indicating whether it's online,
offline, or has no state associated with it), an alert value (indicating whether it's considered healthy or
not), and the last time the health state changed.
Understanding the Health Set States
It's useful to know the states that a health set can be in. Obviously, "healthy" is the preferred state.
When you see this state, it means that Managed Availability hasn't spotted anything wrong with the
components monitored in that health set. When a health set is shown as "unhealthy," it means that one
or more items monitored by that health set are broken in some way. As far as Managed Availability is
concerned, these are the two most important states. If a health set is healthy, it will be left alone. If it's
unhealthy, the responders will be engaged in the order specified by the developers until either the health
set becomes healthy again or the system runs out of responses and escalates the issue by notifying an
administrator about the problem.
There are four additional states that you might see when examining health reports:




Degraded. The "degraded" state means that the monitored item has been unhealthy for less than 60 seconds.
If it's not fixed by the 61st second, its status will change to "unhealthy."
Disabled. The "disabled" state appears when you manually disable a monitor.
Unavailable. The "unavailable" state appears when a monitor doesn't respond to queries from the health
service. This is seldom a good thing and warrants investigation as soon as you see it.
Repairing. The "repairing" state only appears when you set it as the state for a health set. It tells Managed
Availability that you're aware of, and are fixing, problems with that particular component. For example, a
mailbox database that you're reseeding would be labeled as unhealthy, so you'd set its status to "repairing"
while you're working on it so that Managed Availability is aware that the failure is being addressed.
To indicate that you're manually fixing a problem, you use the Set-ServerMonitor cmdlet. For example,
if you need to make repairs on a server named HSVEX01, you'd run the command:
Set-ServerMonitor -Server HSVEX01 -Name Maintenance `
-Repairing $true
When you're done with the repairs, you'd run Set-ServerMonitor again, setting the -Repairing
parameter to $false.
Checking and Setting States
Components are logical objects that might contain multiple services or other objects. For example, the
FrontendTransport component is made up of multiple subcomponents, which you can see using the
Get-MonitoringItemIdentity cmdlet:
Get-MonitoringItemIdentity -Identity FrontendTransport
However, you can't manage those subcomponents as individual items. Instead, you need to use the GetServerComponentState and Set-ServerComponentState cmdlets on the component as a whole.
With the Get-ServerComponentState cmdlet, you can get a list of the components that Exchange
recognizes and those components' states. You use the -Identity parameter to specify the server. For
example, the following command lists the components and their states for the server named
WBSEXMR01:
Get-ServerComponentState WBSEXMR01
Figure 2 shows the results. If you want to see the state of a specific component, you can add the Component parameter followed by the component's name.
Figure 2: Obtaining the States of All the Components on the Specified Server
You can use the Set-ServerComponentState component to manually change the states of individual
components on a server. There are many reasons why you might want to do this, including preparing
servers for maintenance or upgrade, or disabling a component that you don't want running on a
particular server.
To use Set-ServerComponentState, you must include four parameters:

The -Component parameter. You use this parameter to specify the component whose state you're changing.
You can specify the name of a subsystem or component (e.g., UMCallRouter, HubTransport) or the special
value ServerWideOffline, which indicates that you want to change the state of all the components on the
specified server.

The -Identity parameter. You use this parameter to specify the name of the server on which you want to
change the component state.

The -Requester parameter. You use this parameter to specify why you're changing the state. Typically, you'll
be specifying the value of Maintenance. The other possible values are HealthAPI, Sidelined, Functional, and
Deployment.

The -State parameter. You use this parameter to specify the desired state. It can be set to Active or Inactive
for all components. Some components support a third state, Draining. If you set a component's state to
Draining, it indicates that the component should finish processing existing connections, but it shouldn't
accept or make new connections. For example, putting the HubTransport component into Draining state
allows it to finish handling existing SMTP conversations, but it won't be able to participate in new
conversations.
Although you can run Set-ServerComponentState any time, the most common reason to do so is when
you want to tell Managed Availability that you are starting or stopping planned maintenance on a
server. Doing so reduces the risk that Managed Availability will change the state of a component while
you're in the middle of working on it.
Preparing for Maintenance
In Exchange 2010, you typically use the StartDagServerMaintenance.ps1 script to indicate that you're
going to do maintenance on a database availability group (DAG) member server. The process for
performing maintenance on an Exchange 2013 DAG member is a little bit different. Here's how you'd
put a server named DWHDAG01 into maintenance mode:
1. Drain the transport queues by running the command:
Set-ServerComponentState DWHDAG01 `
-Component HubTransport -State Draining `
-Requester Maintenance
2. If your server is being used as a UM server, drain the UM calls by running the command:
Set-ServerComponentState DWHDAG01 `
-Component UMCallRouter -State Draining `
-Requester Maintenance
3. Put the server in maintenance mode by running the command:
Set-ServerComponentState DWHDAG01 `
-Component ServerWideOffline -State Inactive `
-Requester Maintenance
When you're done, you'd run Set-ServerComponentState on the same components but in reverse order
(ServerWideOffline, UMCallRouter, then HubTransport), putting them into the Active state.
Turning Off a Service or Component
Microsoft tests Exchange as a complete set of services, so it doesn't necessarily support turning off
individual services, but sometimes you might need to do so anyway. Perhaps you want to troubleshoot
some aspect of your server's behavior, or you want to turn off services that you know you won't be using.
(The UM services sometimes meet this fate.) If you use the standard Windows service management
tools to stop an Exchange service, Managed Availability will see that as a failure and try to turn the
service back on, working through its responders as designed. The responders for a component might
cause the server to reboot or run a bug check, which could cause problems.
To avoid this situation, you could disable the service using Service Control Manager (SCM), but then
Managed Availability will become unhappy and report that the server's health is poor. The best option
is to turn off the managed service using the Set-ServerComponentState cmdlet. For example, if you
want to turn off the RecoveryActionsEnabled service on the DWHDAG01 server, you'd run the
command:
Set-ServerComponentState -Component RecoveryActionsEnabled `
-Identity DWHDAG01 -State Inactive -Requester Functional
Using Overrides
Managed Availability implements a sort of get-out-of-jail-free card in the form of overrides. An override
allows you to change the thresholds used by the monitor for determining whether a particular
component is healthy or change the action taken by the responder when a component becomes
unhealthy. Typically, you won't have to do this. However, there might be cases when it's necessary. For
example, when Microsoft shipped Cumulative Update 3 (CU3) for Exchange 2013, it added a new probe
for public folder access through Exchange Web Services that would cause the public folder subsystem
to be marked as unhealthy if you didn't have any public folders in Exchange 2013. The fix, as described
in the Microsoft Support article "PublicFolders health set is "Unhealthy" after you install Exchange
Server 2013 Cumulative Update 3," is to add an override to tell Managed Availability to stop caring
about that particular probe.
There are actually two types of overrides: server overrides, which apply to a single server, and global
overrides, which apply to all servers in the Exchange organization. You apply overrides using the AddServerMonitoringOverride or Add-GlobalMonitoringOverride cmdlet. Here are some examples from
the Microsoft Support article:
Add-GlobalMonitoringOverride -Identity `
"Publicfolders\PublicFolderLocalEWSLogonEscalate" `
-ItemType "Responder"-PropertyName Enabled `
-PropertyValue 0 -ApplyVersion "15.0.775.38"
Add-GlobalMonitoringOverride -Identity `
"Publicfolders\PublicFolderLocalEWSLogonMonitor" `
-ItemType "Monitor" -PropertyName Enabled `
-PropertyValue 0 -ApplyVersion "15.0.775.38"
Add-GlobalMonitoringOverride -Identity `
"Publicfolders\PublicFolderLocalEWSLogonProbe" `
-ItemType "Probe" -PropertyName Enabled `
-PropertyValue 0 -ApplyVersion "15.0.775.38"
Note that the cmdlet appears three times with the same value for the -Identity parameter: once to
disable the responder, once to disable the monitor, and once to disable the probe for the specified
object.
Dealing with Occasional Oddities in Managed Availability
Managed Availability is a complex subsystem, and you might find that it occasionally behaves in ways
you don't expect. For example, when you set the state of the FrontendTransport and HubTransport
components to Inactive or Draining, you might notice that you're still seeing event log IDs 7011 and
7012, indicating that Managed Availability thinks those services are down. Managed Availability will
eventually trigger a responder that restarts the services for you, or you can manually restart them to
restore the correct behavior. It's also sometimes the case that other operations confuse Managed
Availability so that it isn't aware of, or doesn't report, the correct state of the items it monitors. For
example, installing Exchange 2013 CU3 would sometimes make monitored services incorrectly appear
to be unhealthy. These problems are usually easy to fix by restarting the affected service or using SetServerComponentState. You also have the ability to create overrides for cases when you have no other
easy way to fix the problem. Problems like these are pretty rare, and they don't detract from the utility
of Managed Availability over the long term.
The Future of Managed Availability
Managed Availability has a huge amount of potential because of the nature of its design. The people
who wrote the code that makes up Exchange also wrote the Managed Availability probes, monitors, and
responders that monitor it, so as the Exchange code evolves and changes, Managed Availability can
keep pace. The idea of having a self-monitoring, self-healing Exchange server is an attractive one,
although in its current implementation it's limited to watching individual servers. The existing UI for
Managed Availability relies on the Exchange Management Shell (EMS), and that might not change,
although hopefully we'll see better integration between the monitoring tools available in the Exchange
Admin Center (EAC) and the health reports generated by Managed Availability. As Office 365 expands
in scale, Microsoft is likely to continue investing in Managed Availability's ability to correlate health
states across larger parts of the Exchange infrastructure, and those improvements will move over to onpremises implementations, too.
Da <http://windowsitpro.com/exchange-server-2013/exchange-server-2013-managed-availability>
New Managed Availability Documentation Available
Ensuring that users have a good email experience has always been the primary objective for messaging system
administrators. To help ensure the availability and reliability of your messaging system, all aspects of the system must
be actively monitored, and any detected issues must be resolved quickly.
In previous versions of Exchange, monitoring critical system components often involved using an external application
such as Microsoft System Center 2012 Operations Manager to collect data, and to provide recovery action for problems
detected as a result of analyzing the collected data. Exchange 2010 and previous versions included health manifests
and correlation engines in the form of management packs. These components enabled Operations Manager to make
a determination as to whether a particular component was healthy or unhealthy. In addition, Operations Manager also
used the diagnostic cmdlet infrastructure built into Exchange 2010 to run synthetic transactions against various aspects
of the system.
Exchange 2013 takes a new approach to monitoring and preserving the end user experience natively using a feature
called Managed Availability that provides built-in monitoring and recovery actions.
Overview
Managed availability, also known as Active Monitoring or Local Active Monitoring, is the integration of built-in
monitoring and recovery actions with the Exchange high availability platform. It’s designed to detect and recover from
problems as soon as they occur and are discovered by the system. Unlike previous external monitoring solutions and
techniques for Exchange, managed availability doesn’t try to identify or communicate the root cause of an issue. It’s
instead focused on recovery aspects that address three key areas of the user experience:
Availability Can users access the service?
Latency How is the experience for users?
Errors Are users able to accomplish what they want?
Managed availability is an internal process that runs on every Exchange 2013 server. It polls and analyzes hundreds of
health metrics every second. If something is found to be wrong, most of the time it will be fixed automatically. But there
will always be issues that managed availability won’t be able to fix on its own. In those cases, managed availability will
escalate the issue to an administrator by means of event logging.



For more information about this new feature, see the newly published topic Managed Availability.
Health Sets
From a reporting perspective, managed availability has two views of health, one internal and one external. The internal
view uses health sets. Each component in Exchange 2013 (for example, Outlook Web App, Exchange ActiveSync, the
Information Store service, content indexing, transport services, etc.) is monitored by managed availability using probes,
monitors, and responders. A group of probes, monitors and responders for a given component is called a health set. A
health set is a group of probes, monitors, and responders that determine if that component is healthy. The current state
of a health set (e.g., whether it is healthy or unhealthy) is determined by using the state of the health set’s monitors. If
all of a health set’s monitors are healthy, then the health set is in a healthy state. If any monitor is not in a healthy state,
then the health set state will be determined by its least healthy monitor.
For detailed steps to view server health or health sets state, see the newly published topic Manage Health Sets and
Server Health. For information on troubleshooting health sets, see this topic.
Health Groups
The external view of managed availability is composed of health groups. Health groups are exposed to System Center
Operations Manager 2007 R2 and System Center Operations Manager 2012.
There are four primary health groups:


Customer Touch Points Components that affect real-time user interactions, such as protocols, or the
Information Store
Service Components Components without direct, real-time user interactions, such as the Microsoft Exchange
Mailbox Replication service, or the offline address book generation process (OABGen)
Server Components The physical resources of the server, such as disk space, memory and networking
Dependency Availability The server’s ability to access necessary dependencies, such as Active Directory, DNS,
etc.
When the Exchange 2013 Management Pack is installed, System Center Operations Manager (SCOM) acts as a health
portal for viewing information related to the Exchange environment. The SCOM dashboard includes three views of
Exchange server health:


1. Active Alerts Escalation Responders write events to the Windows event log that are consumed by the monitor
within SCOM. These appear as alerts in the Active Alerts view.
2. Organization Health A rollup summary of the overall health of the Exchange organization health is displayed in
this view. These rollups include displaying health for individual database availability groups, and health within
specific Active Directory sites.
3. Server Health Related health sets are combined into health groups and summarized in this view.
Overrides
Overrides provide an administrator with the ability to configure some aspects of the managed availability probes,
monitors, and responders. Overrides can be used to fine tune some of the thresholds used by managed availability.
They can also be used to enable emergency actions for unexpected events that may require configuration settings that
are different from the out-of-box defaults.
Overrides can be created and applied to a single server (this is known as a server override), or they can be applied to a
group of servers (this is known as a global override). Server override configuration data is stored in the Windows registry
on the server on which the override is applied. Global override configuration data is stored in Active Directory.
Overrides can be configured to last indefinitely, or they can be configured for a specific duration. In addition, global
overrides can be configured to apply to all servers, or only servers running a specific version of Exchange.
For detailed steps to view or configure server or global overrides, see Configure Managed Availability Overrides.
When you configure an override, it will not take effect immediately. The Microsoft Exchange Health Manager service
checks for updated configuration data every 10 minutes. In addition, global overrides will be dependent on Active
Directory replication latency.
Below are some examples of adding and removing global and server overrides:
Example 1 – Make Information Store maintenance assistant alerts non-urgent for 60 days:
Add-GlobalMonitoringOverride -Identity Store\MaintenanceAssistantEscalate -ItemType
Responder -PropertyName NotificationServiceClass -PropertyValue 1 -Duration
60.00:00:00
Example 2 – Change the maintenance assistant monitor to look for 32 hours of failures for 30 days:
Add-GlobalMonitoringOverride -Identity
Store\DirectoryServiceAndStoreMaintenanceAssistantMonitor -ItemType Monitor PropertyName MonitoringIntervalSeconds -PropertyValue 115200 -Duration 30.00:00:00
Example 3 – Remove the maintenance assistant monitor override added in Example 2:
Remove-GlobalMonitoringOverride -Identity
Store\DirectoryServiceAndStoreMaintenanceAssistantMonitor -ItemType Monitor PropertyName MonitoringIntervalSeconds
Example 4 – Remove the Information Store maintenance assistant alerts non-urgent override added in
Example 1:
Remove-GlobalMonitoringOverride -Identity Store\MaintenanceAssistantEscalate -ItemType
Responder -PropertyName NotificationServiceClass
Example 5 – Apply the database repeatedly mounting threshold override (change to 60 minutes) for a period
of 60 days:
Add-GlobalMonitoringOverride -Identity Store\DatabaseRepeatedMountsMonitor -ItemType
Monitor -PropertyName MonitoringIntervalSeconds -PropertyValue 3600 -Duration
60.00:00:00
Example 6 – Remove the database repeatedly mounting threshold override added in Example 5:
Remove-GlobalMonitoringOverride -Identity Store\DatabaseRepeatedMountsMonitor ItemType Monitor -PropertyName MonitoringIntervalSeconds
Example 7 – Change the database dismounted alert from HA to Store for a period of 7 days:
Add-GlobalMonitoringOverride -Identity Store\DatabaseAvailabilityEscalate -ItemType
Responder -PropertyName ExtensionAttributes.Microsoft.Mapi.MapiExceptionMdbOffline PropertyValue Store -Duration 7.00:00:00
Example 8 – Disable VersionBucketsAllocated monitor for a period of 60 days:
Add-GlobalMonitoringOverride -Identity Store\VersionBucketsAllocatedMonitor -ItemType
Monitor -PropertyName Enabled -PropertyValue 0 -Duration 60.00:00:00
Example 9 – Update logs threshold in DatabaseSize monitor for a period of 60 days:
Add-GlobalMonitoringOverride -Identity MailboxSpace\DatabaseSizeMonitor -ItemType
Monitor -PropertyName ExtensionAttributes.DatabaseLogsThreshold -PropertyValue 100GB Duration 60.00:00:00
Example 10 – Applying a server override to disable quarantine monitor across all database copies for a period
of 7 days:
(get-mailboxDatabase <DB Name>).servers | %{Add-ServerMonitoringOverride -Server
$_.name -Identity "Store\MailboxQuarantinedMonitor\<DB Name>" -ItemType Monitor PropertyName Enabled -PropertyValue 0 -Duration:7.00:00:00 -Confirm:$false;}
Management Tasks and Cmdlets
There are three primary operational tasks that administrators will typically perform with respect to managed availability:
Extracting or viewing system health
Viewing health sets, and details about probes, monitors and responders
Managing overrides
The two primary management tools for managed availability are the Windows Event Log and the Shell. Managed
availability logs a large amount of information in the Exchange ActiveMonitoring and ManagedAvailability crimson
channel event logs, such as:



Probe, monitor, and responder definitions, which are logged in the respective *Definition event logs.
Probe, monitor, and responder results, which are logged in the respective *Results event logs.
Details about responder recovery actions, including when the recovery action is started, and it is considered
complete (whether successful or not), which are logged in the RecoveryActionResults event log.
There are 12 cmdlets used for managed availability, which are described in the following table.



Cmdlet
Description
Get-ServerHealth
Used to get raw server health information, such as health sets and their current
state (healthy or unhealthy), health set monitors, server components, target
resources for probes, and timestamps related to probe or monitor start or stop
times, and state transition times.
Get-HealthReport
Used to get a summary health view that includes health sets and their current
state.
Get-MonitoringItemIdentity
Used to view the probes, monitors, and responders associated with a specific
health set.
Get-MonitoringItemHelp
Used to view descriptions about some of the properties of probes, monitors, and
responders.
Add-ServerMonitoringOverride
Used to create a local, server-specific override of a probe, monitor, or responder.
Get-ServerMonitoringOverride
Used to view a list of local overrides on the specified server.
RemoveServerMonitoringOverride
Used to remove a local override from a specific server.
Add-GlobalMonitoringOverride
Used to create a global override for a group of servers.
Get-GlobalMonitoringOverride
Used to view a list of global overrides configured in the organization.
RemoveGlobalMonitoringOverride
Used to remove a global override.
Set-ServerComponentState
Used to configure the state of one or more server components.
Get-ServerComponentState
Used to view the state of one or more server components.
Da <https://blogs.technet.microsoft.com/scottschnoll/2013/11/25/new-managed-availability-documentation-available/>
Responding to Managed Availability
I’ve written a few blog posts now that get into the deep technical details of Managed Availability. I hope
you’ve liked them, and I’m not about to stop! However, I’ve gotten a lot of feedback that we also need
some simpler overview articles. Fortunately, we’ve just completed documentation on TechNet with an
overview of Managed Availability. This was written to address how the feature may be managed day-today.
Even that documentation doesn’t address how you respond when Managed Availability cannot resolve
a problem on its own. This is the very most common interaction with Managed Availability, but we
haven’t described how specifically to do so.
When Managed Availability is unable to recover the health of a server, it logs an event. Exchange Server
has a long history of logging warning, error, and critical events into various channels when things go
wrong. However, there are two things about Managed Availability events that make them more generally
useful than our other error events:
They all go to the same place on a server without any clutter
 They will only be logged when the standard recovery actions fail to restore health of the component
When one of these events is logged on any server in our datacenters, a member of the product group
team responsible for that health set gets an immediate phone call.

No one likes to wake up at 2 AM to investigate and fix a problem with a server. This keeps us motivated
to only have Managed Availability alerts for problems that really matter, and also to eliminate the cause
of the alert by fixing underlying code bugs or automating the recovery. At the same time, there is nothing
worse than finding out about incidents from customer calls to support. Every time that happens we have
painful meetings about how we should have detected the condition first and woken someone up. These
two conflicting forces strongly motivate the entire engineering team to keep these events accurate and
useful.
The GUI
Along with a phone call, the on-call engineer receives an email with some information about the failure.
The contents of this email are pulled from the event’s description.
The path in Event Viewer for these events is Microsoft-Exchange-ManagedAvailability/Monitoring. Error
event 4 means that a health set has failed and gives the details of the monitor that has detected the
failure. Information event 1 means that all monitors of a health set have become healthy.
The Exchange 2013 Management Pack for System Center Operations Manager nicely shows only the
health sets that are currently failed instead of the Event Viewer’s method of displaying all health sets
that have ever failed. SCOM will also roll up health sets into four primary health groups or three views.
The Shell
This wouldn’t be EHLO without some in-depth PowerShell scripts. The event viewer is nice and SCOM is
great, but not everyone has SCOM. It would be pretty sweet to get the same behavior as SCOM to show
only the health sets on a server that are currently failed.
Note: these logs serve a slightly different purpose than Get-HealthReport. Get-HealthReport shows the
current health state of all of a server’s monitors. On the other hand, events are only logged in this channel
once all the recovery actions for that monitor have been exhausted without fixing the problem. Also
know that these events detail the failure. If you’re only going to take action based on one health metric,
the events in this log is a better one. Get-HealthReport is still the best tool to show you the up-to-theminute user experience.
We have a sample script that can help you with this; it is commented in a way that you can see what we
were trying to accomplish. You can get the Get-ManagedAvailabilityAlerts.ps1 script here.
Either this method or Event Viewer will work pretty well for a handful of servers. If you have tens or
hundreds of servers, we really recommend investing in SCOM or another robust and scalable eventcollection system.
My other posts have dug deeply into troubleshooting difficult problems, and how Managed Availability
gives an overwhelmingly immense amount of information about a server’s health. We rarely need to use
these troubleshooting methods when running our datacenters. However, the only thing you need to
resolve Exchange problems the way we do in Office 365 is a little bit of event viewer or scheduled script.
Abram Jackson
Program Manager, Exchange Server
Tags Administration Exchange 2013 High Availability On premises Scripting Tips 'n Tricks Troubleshooting
Da <https://blogs.technet.microsoft.com/exchange/2013/12/16/responding-to-managed-availability/>
Managed Availability
Managed Availability
Managed availability, also known as Active Monitoring or Local Active Monitoring, is the integration of built-in monitoring and
recovery actions with the Exchange high availability platform. It's designed to detect and recover from problems as soon as
they occur and are discovered by the system. Unlike previous external monitoring solutions and techniques for Exchange,
managed availability doesn't try to identify or communicate the root cause of an issue. It's instead focused on recovery
aspects that address three key areas of the user experience:
 Availability Can users access the service?
 Latency How is the experience for users?
 Errors Are users able to accomplish what they want?
Managed availability provides a native health monitoring and recovery solution. It moves away from monitoring individual
separate slices of the system to monitoring the end-to-end user experience, and protecting the end user's experience through
recovery-oriented actions.
Managed availability is an internal process that runs on every Exchange 2016 server. It polls and analyzes hundreds of health
metrics every second. If something is found to be wrong, most of the time it will be fixed automatically. But there will always
be issues that managed availability won’t be able to fix on its own. In those cases, managed availability will escalate the issue
to an administrator by means of event logging.
Managed availability is implemented in the form of two services:
 Exchange Health Manager Service (MSExchangeHMHost.exe) This is a controller process used to manage
worker processes. It's used to build, execute, and start and stop the worker process, as needed. It's also used to
recover the worker process in case that process fails, to prevent the worker process from being a single point of
failure.
 Exchange Health Manager Worker process (MSExchangeHMWorker.exe) This is the worker process
responsible for performing run-time tasks within the managed availability framework.
Managed availability uses persistent storage to perform its functions:
 XML files in the \bin\Monitoring\config folder are used to store configuration settings for some of the probe and
monitor work items.
 Active Directory is used to store global overrides.
 The Windows registry is used to store run-time data, such as bookmarks, and local (server-specific) overrides.
 The Windows crimson channel event log infrastructure is used to store the work item results.
 Health mailboxes are used for probe activity. Multiple health mailboxes will be created on each mailbox database
that exists on the server.
Managed Availability Components
As illustrated in the following drawing, managed availability includes three main asynchronous components that are
constantly doing work.
Managed Availability Components
Probes
The first component is called a Probe. Probes are responsible for taking measurements on the server and collecting data.
There are three primary categories of probes: recurrent probes, notifications, and checks. Recurrent probes are synthetic
transactions performed by the system to test the end-to-end user experience. Checks are the infrastructure that perform the
collection of performance data, including user traffic. Checks also measure the collected data against thresholds that are set
to determine spikes in user failures, which enable the checks infrastructure to become aware when users are experiencing
issues. Finally, the notification logic enables the system to take action immediately, based on a critical event, and without
having to wait for the results of the data collected by a probe. These are typically exceptions or conditions that can be
detected and recognized without a large sample set.
Recurrent probes run every few minutes and evaluate some aspect of service health. These probes might transmit an email
via Exchange ActiveSync to a monitoring mailbox, they might connect to an RPC endpoint, or they might verify Client Accessto-Mailbox connectivity.
All probes are defined on Health Manager service startup in the Microsoft.Exchange.ActiveMonitoring\ProbeDefinition
crimson channel. Each probe definitions has many properties, but the most relevant properties are:
 Name The name of the probe, which begins with a SampleMask of the probe’s monitor.
 TypeName The code object type of the probe that contains the probe’s logic.
 ServiceName The name of the health set that contains this probe.
 TargetResource The object the probe is validating. This is appended to the name of the probe when it is executed
to become a probe result ResultName
 RecurrenceIntervalSeconds How often the probe executes.
 TimeoutSeconds How long the probe will wait before failing.
There are hundreds of recurrent probes. Many of these probes are per-database, so as the number of databases increases,
so does the number of probes. Most probes are defined in code and are therefore not directly discoverable.
The basics of a recurrent probe are as follows: start every RecurrenceIntervalSeconds and check (or probe) some aspect of
health. If the component is healthy, the probe passes and writes an informational event to the
Microsoft.Exchange.ActiveMonitoring\ProbeResult channel with a ResultTypeof 3. If the check fails or times out, the probe
fails and writes an error event to the same channel. A ResultType of 4 means the check failed and a ResultType of 1 means
that it timed out. Many probes will re-run if they timeout, up to the value of the MaxRetryAttempts property.
Nota:
The ProbeResult crimson channel can get very busy with hundreds of probes running every few minutes and logging an
event, so there can be a real impact on the performance of your Exchange server if you try expensive queries against
the event logs in a production environment.
Notifications are probes that are not run by the health manager framework, but by some other service on the server. These
services perform their own monitoring, and then feed their data into the Managed Availability framework by directly writing
probe results. You won’t see these probes in the ProbeDefinition channel, as this channel only describes probes that will be
run by the Managed Availability framework. For example, the ServerOneCopyMonitor Monitor is triggered by probe results
written by the MSExchangeDAGMgmt service. This service performs its own monitoring, determines whether there is a
problem, and logs a probe result. Most notification probes have the capability to log both a red event that turns the monitor
unhealthy and a green event that makes the monitor healthy again.
Checks are probes that only log events when a performance counter passes above or below a defined threshold. They are
really a special case of notification probes, as there is a service monitoring the performance counters on the server and
logging events to the ProbeResult channel when the configured threshold is met.
To find the counter and threshold that is considered unhealthy, you can look at the monitor for this check. Monitors of the
type Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueAboveThresholdMonitor or Microsoft.Office.
Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueBelowThresholdMonitor mean that the probe they watch is a
check probe
Monitor
The results of the measurements collected by probes flow into the second component, the Monitor. The monitor contains all
of the business logic used by the system on the data collected. Similar to a pattern recognition engine, the monitor looks for
the various different patterns on all the collected measurements, and then it decides whether something is considered
healthy.
Monitors query the data to determine if action needs to be taken based on a predefined rule set. Depending on the rule or
the nature of the issue, a monitor can either initiate a responder or escalate the issue to a human via an event log entry. In
addition, monitors define how much time after a failure that a responder is executed, as well as the workflow of the recovery
action. Monitors have various states. From a system state perspective, monitors have two states:
 Healthy The monitor is operating properly and all collected metrics are within normal operating parameters.
 Unhealthy The monitor isn't healthy and has either initiated recovery through a responder or notified an
administrator through escalation.
From an administrative perspective, monitors have additional states that appear in the Shell:
 Degraded When a monitor is in an unhealthy state from 0 through 60 seconds, it's considered Degraded. If a
monitor is unhealthy for more than 60 seconds, it is considered Unhealthy.
 Disabled The monitor has been explicitly disabled by an administrator.
 Unavailable The Exchange Health service periodically queries each monitor for its state. If it doesn't get a response
to the query, the monitor state becomes Unavailable.
 Repairing An administrator sets the Repairing state to indicate to the system that corrective action is in process by
a human, which allows the system and humans to differentiate between other failures that may occur at the same
time corrective action is being taken (such as a database copy reseed operation).
Every monitor has a SampleMask property in its definition. As the monitor executes, it looks for events in the ProbeResult
channel that have a ResultName that matches the monitor’s SampleMask. These events could be from recurrent probes,
notifications, or checks. If the monitor’s thresholds are achieved, it becomes Unhealthy. From the monitor’s perspective, all
three probe types are the same as they each log to the ProbeResult channel.
It is worth noting that a single probe failure does not necessarily indicate that something is wrong with the server. It is the
design of monitors to correctly identify when there is a real problem that needs fixing. This is why many monitors have
thresholds of multiple probe failures before becoming Unhealthy. Even then, many of these problems can be fixed
automatically by responders, so the best place to look for problems that require manual intervention is in the
Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel. This will include the most recent probe error.
Responders
Finally, there are Responders, which are responsible for recovery and escalation actions. As their name implies, responders
execute some sort of response to an alert that was generated by a monitor. When something is unhealthy, the first action is
to attempt to recover that component. This could include multi-stage recovery actions; for example, the first attempt may be
to restart the application pool, the second may be to restart the service, the third attempt may be to restart the server, and
the subsequent attempt may be to take the server offline so that it no longer accepts traffic. If the recovery actions are
unsuccessful, the system escalates the issue to a human through event log notifications.
Responders take a variety of recovery actions, such as resetting an application worker pool or restarting a server. There are
several types of responders:
 Restart Responder Terminates and restarts a service.
 Reset AppPool Responder Stops and restarts an application pool in Internet Information Services (IIS).
 Failover Responder Initiates a database or server failover.
 Bugcheck Responder Initiates a bugcheck of the server, thereby causing a server reboot.
 Offline Responder Takes a protocol on a server out of service (rejects client requests).
 Online Responder Places a protocol on a server back into production (accepts client requests).
 Escalate Responder Escalates the issue to an administrator via event logging.
In addition to the above listed responders, some components also have specialized responders that are unique to their
component.
All responders include throttling behavior, which provide a built-in sequencing mechanism for controlling responder actions.
The throttling behavior is designed to ensure that the system isn’t compromised or made worse as a result of responder
recovery actions. All responders are throttled in some fashion. When throttling occurs, the responder recovery action may be
skipped or delayed, depending on the responder action. For example, when the Bugcheck Responder is throttled, its action
is skipped, and not delayed.
Health Sets
From a reporting perspective, managed availability has two views of health, one internal and one external.
The internal view uses health sets. Each component in Exchange 2016 (for example, Outlook sul Web, Exchange ActiveSync,
the Information Store service, content indexing, transport services, etc.) is monitored by managed availability using probes,
monitors, and responders. A group of probes, monitors and responders for a given component is called a health set. A health
set is a group of probes, monitors, and responders that determine if that component is healthy. The current state of a health
set (e.g., whether it is healthy or unhealthy) is determined by using the state of the health set’s monitors. If all of a health
set’s monitors are healthy, then the health set is in a healthy state. If any monitor is not in a healthy state, then the health set
state will be determined by its least healthy monitor.
For detailed steps to view server health or health sets state, see Gestire i set di integrità e l'integrità del server.
Health Groups
The external view of managed availability is composed of health groups. Health groups are exposed to System Center
Operations Manager 2012 R2.
There are four primary health groups:
 Customer Touch Points Components that affect real-time user interactions, such as protocols, or the Information
Store.
 Service Components Components without direct, real-time user interactions, such as the Microsoft Exchange
Mailbox Replication service, or the offline address book generation process (OABGen).
 Server Components The physical resources of the server, such as disk space, memory and networking.
 Dependency Availability The server’s ability to access necessary dependencies, such as Active Directory, DNS, etc.
When the Exchange Management Pack is installed, System Center Operations Manager (SCOM) acts as a health portal for
viewing information related to the Exchange environment. The SCOM dashboard includes three views of Exchange server
health:
 Active Alerts Escalation Responders write events to the Windows event log that are consumed by the monitor within
SCOM. These appear as alerts in the Active Alerts view.
 Organization Health A roll up summary of the overall health of the Exchange organization health is displayed in this
view. These rollups include displaying health for individual database availability groups, and health within specific
Active Directory sites.
 Server Health Related health sets are combined into health groups and summarized in this view.
Overrides
Overrides provide an administrator with the ability to configure some aspects of the managed availability probes, monitors,
and responders. Overrides can be used to fine tune some of the thresholds used by managed availability. They can also be
used to enable emergency actions for unexpected events that may require configuration settings that are different from the
out-of-box defaults.
Overrides can be created and applied to a single server (this is known as a server override), or they can be applied to a group
of servers (this is known as a global override). Server override configuration data is stored in the Windows registry on the
server on which the override is applied. Global override configuration data is stored in Active Directory.
Overrides can be configured to last indefinitely, or they can be configured for a specific duration. In addition, global overrides
can be configured to apply to all servers, or only servers running a specific version of Exchange.
When you configure an override, it will not take effect immediately. The Microsoft Exchange Health Manager service checks
for updated configuration data every 10 minutes. In addition, global overrides will be dependent on Active Directory
replication latency.
For detailed steps to view or configure server or global overrides, see Configurare le sostituzioni di disponibilità gestita.
Management Tasks and Cmdlets
There are three primary operational tasks that administrators will typically perform with respect to managed availability:
 Extracting or viewing system health
 Viewing health sets, and details about probes, monitors and responders
 Managing overrides
The two primary management tools for managed availability are the Windows Event Log and the Shell. Managed availability
logs a large amount of information in the Exchange ActiveMonitoring and ManagedAvailability crimson channel event logs,
such as:
 Probe, monitor, and responder definitions, which are logged in the respective *Definition event logs.
 Probe, monitor, and responder results, which are logged in the respective *Results event logs.
 Details about responder recovery actions, including when the recovery action is started, and it is considered complete
(whether successful or not), which are logged in the RecoveryActionResults event log.
There are 12 cmdlets used for managed availability, which are described in the following table.
Cmdlet
Description
Get-ServerHealth
Used to get raw server health information, such as health sets and their current state
(healthy or unhealthy), health set monitors, server components, target resources for
probes, and timestamps related to probe or monitor start or stop times, and state
transition times.
Get-HealthReport
Used to get a summary health view that includes health sets and their current state.
Get-MonitoringItemIdentity
Used to view the probes, monitors, and responders associated with a specific health set.
Get-MonitoringItemHelp
Used to view descriptions about some of the properties of probes, monitors, and
responders.
AddServerMonitoringOverride
Used to create a local, server-specific override of a probe, monitor, or responder.
GetServerMonitoringOverride
Used to view a list of local overrides on the specified server.
RemoveServerMonitoringOverride
Used to remove a local override from a specific server.
AddGlobalMonitoringOverride
Used to create a global override for a group of servers.
GetGlobalMonitoringOverride
Used to view a list of global overrides configured in the organization.
RemoveGlobalMonitoringOverride
Used to remove a global override.
Set-ServerComponentState
Used to configure the state of one or more server components.
Get-ServerComponentState
Used to view the state of one or more server components.
Make sense of the new Managed Availability feature
The new platform built into Exchange 2013 may ease your monitoring burden. Our Exchange MVP breaks down how
Managed Availability works and how it's designed to do its job.
Managed Availability is one of Exchange 2013's interesting new features. It is essentially Exchange's built-in monitoring and
remediation platform.
It's the latter part -- remediation -- that continues to confuse Exchange admins. Because Managed Availability is tightly
integrated into Exchange 2013, it will take mitigating actions whenever it discovers an issue. Actions depend on the issue,
but rebooting the server -- referred to as a "bugcheck" -- is one example.
Before diving into the nuts and bolts of Managed Availability, let's have a look at how the feature works and how it's
designed to do its job.
How Managed Availability maintains a server's health
Managed Availability uses a multi-layered approach when it comes to evaluating and maintaining a server's health. At
the foundation, a series of tests or polls (counters, for example) are executed; these tests are referred to as probes. Each
of the probes yields a result, which is in turn inspected by the second layer in Managed Availability: monitors.
As the name implies, monitors are the logic that interprets the results that the different probes yield. How a monitor
interprets the result of a probe is programmatically determined; it's based on Microsoft's experience with the product and
likely to be influenced by what the company sees is happening in Office 365.
Depending on the results of the probes, monitors will either do nothing or kick off so-called responders. Responders are
responsible for executing specific actions to try and remediate the potential issue one of the probes discovered. An example
of a responder action is restarting a service.
When the service restarts, a probe might be able to receive different -- and better -- results the next time it runs. The
monitor might then determine the service is healthy again. However, sometimes the results are still negative, which could
occur when the service stops again or because it's not performing well.
In this case of negative results, many scenarios are possible. If the monitor determines the service is still unhealthy, it might
use a different responder. This responder might then take a different action (for instance, restarting the server). The actions
taken and the order of the actions depend on the definition of the monitor and what the component is. For some
components, multiple responders will first be tried before the issue escalates, so you don't have to worry that a low-impact
service might cause a server reboot. Additionally, responders are throttled, so certain actions can only be taken once or
twice a day.
If none of the actions solve the issue, Managed Availability will escalate the issue, meaning it will notify an Exchange admin
that an issue exists and should be examined by raising an alert in the Event Logs. This event can then be picked up by
whatever monitoring option you have, which in turn can alert the admin. The escalation of an issue is another responder
that is programmed to create the alert (Figure 1).
Figure 1
Health Manager Service and the worker process
Managed Availability is composed of two different services or processes, much like the new Exchange 2013 store service.
The Health Manager Service is the parent process that controls the Health Manager Worker, the child process. The worker
process is responsible for executing the different tasks Managed Availability has to perform.
The process hierarchy is clearly exposed when using Process Explorer(Figure 2), for example.
Figure 2
The Health Manager Service isn't only responsible for starting or stopping the worker process -- it also ensures the worker
process works correctly. If it finds that the process hangs (or that it didn't start), it will restart the process as needed.
In a multi-site environment, the Health Manager Service itself is also being watched; only it's not a process on the server
itself -- the Health Manager Service on another Exchange Server does so.
A closer look at how Managed Availability stores
information
To learn Managed Availability's inner workings, Exchange admins must understand where it stores information and how to
access it.
There are a number of new features in Exchange 2013, but one feature admins should know about is Managed
Availability....
It's built into Exchange as a monitoring and remediation platform -- and it's the remediation part that can be confusing.
Once you've learned what Managed Availability is and how it looks on Exchange Server, you can understand where
Managed Availability stores information and how to access it.
Managed Availability mainly stores information for configuration and logging in two different places. The better part of its
configuration information is stored in a number of xml files in the <installdrive>:\Program
Files\Microsoft\Exchange\v15\Bin\Monitoring\Config folder.
Whenever the Health Manager Service starts up, it reads the files and uses defined settings. While it's possible to alter
these files, I don't recommend it. Not only is it dangerous -- an error might prevent the service from starting up -- but most
of the settings are undocumented or poorly documented. These files should be a last resort when you need to make a
change.
Active Directory & server registry
In addition to the local configuration files, Managed Availability also uses Active Directory and the
server's local registry to store additional configuration information. More specifically, overrides are
either stored in Active Directory -- so they're available to all servers -- or in the server's local
registry, if they only apply to the local server.
Event logs
The event logs' Crimson Channel stores information for logging. As mentioned, Managed Availability
logs every action it takes. At startup, the probe-, monitor- and responder definitions are stored in the
corresponding event logs under Microsoft/Exchange/ActiveMonitoring (Figure 1).
Figure 1
Managed Availability structures stored information into HealthSets, Probes, Monitors and Overrides. Here's a look at how
each structures Managed Availability's information.
HealthSets. Managed Availability uses a set of hundreds of probes, monitors and responders to determine a server's health.
To keep track of all of them, each set that relates to a specific component is grouped into a so-called HealthSet. Running
the Get-ServerHealth cmdlet on an Exchange Server reveals these HealthSets (Figure 2).
Figure 2
For instance, there is a HealthSet named "ActiveSync." This HealthSet groups all of the probes, monitors and responders
responsible for monitoring and mitigating the server's ActiveSync component. To view what monitors are part of this
HealthSet, you can also use the Get-ServerHealth cmdlet and use the –HealthSet parameter to narrow down the
results (Figure 3), as such:
Get-ServerHealth <servername> -HealthSet ActiveSync
Figure 3
Probes. Probes query or test a specific component on the server. There are a number of probe types, ranging from simple
ones, which will fetch the value of a specific performance counter, to more complex ones, which will carry out a battery of
tests like mimicking a user's behavior. These tests are also referred to as "synthetic transactions." Probes just fetch the
information and carry out tests; they don't evaluate the results or values that are returned.
To see what probes run on a specific server, you can have a look at the ProbeDefinition event log in the Crimson Channel.
That's where the Health Manager Service writes the probes that will run on the system when it starts. The easiest way to
get the information is through the GUI, or you can use PowerShell (Figure 4).
Figure 4
The relevant portion of the information is written in XML format, but is a little more readable in Friendly View. Typically,
there are two types of information you would get from probes: what probes are running on the system and what resources
(e.g., Mailbox Database) they run on.
Having that information is the first step to understanding what actions Managed Availability might have taken to remediate
a problem it found.
Monitors. The next step to decipher Managed Availability is looking at the different monitors that exist on the system.
Similar to probes, these monitors are stored in the Crimson Channel's Monitor Definition Event log.
Monitors behave in different ways and have different stages. Every time a monitor fails, it will move to the next stage,
which will call upon a different responder to solve the issue. If the issue isn't solved after a set number of failed attempts,
it escalates to an admin. When a monitor shifts into a new stage, the TimeOutInSeconds property in
the StateTransitionXmlproperty of a monitor defines the shift (Figure 5).
By using the Get-WinEvent cmdlet on the MonitorDefinition Event Log, you can see the different stages that are
configured for a specific monitor; this article describes the process. The GUI is also an option, but PowerShell is more
efficient. Note that there are different stages that exist for the AutodiscoverProxyTest monitor (Figure 5).
Figure 5
A monitor can have multiple states, most of which aren't exposed to the admin. Typically, an admin would only see a
monitor being Healthy or Unhealthy. The other states (Degrade, Unhealthy, Unrecoverable) are hidden and only visible
through PowerShell.
Overrides. Sometimes you don't want a specific monitor to run -- because it causes more trouble than help or when one
of the configured defaults doesn't meet your own requirements, for example. Using monitoring overrides, you can
reconfigure a monitor to use other threshold values. For example, a popular monitoring override is used to change the
threshold Managed Availability uses to determine if there's enough free disk space left on a database log disk.
You can use the following command to reset the value to 10 GB from the default value and configure it to last for 90 days.
It's not possible to add an override that lasts indefinitely, so keep track of when an override expires so you can reconfigure
it afterward.
Add-GlobalMonitoringOverride -Identity
MailboxSpace\DatabaseSizeMonitor -ItemType Monitor PropertyName ExtensionAttributes.DatabaseLogsThreshold PropertyValue 10GB -Duration 90.00:00:00
This command will add a global monitoring override that applies to all servers in the environment. That's also why global
overrides are stored in Active Directory. If you want the override to apply only to a single server, you should use the AddServerMonitoringOverride cmdlet instead.
Download