When Microsoft introduced Managed Availability in Exchange Server 2013, its appearance sowed quite a bit of confusion in the Exchange world. The idea behind it—an automated system of health monitoring that would watch critical components of your Exchange infrastructure and automatically take corrective action to fix problems as they occurred—sounded great, but at its launch, Managed Availability was poorly documented and largely misunderstood. Now that Exchange 2013 has been out in the field for a while, both Microsoft and its customers are getting more operational experience with Managed Availability. Understanding how it works and why it works that way will help you understand how Managed Availability will affect your operating procedures and how to manage it to get the desired outcomes. Note that you'll sometimes see references to Active Monitoring (AM) and Local Active Monitoring (LAM) in the Managed Availability world. They're functional descriptions of the feature, not real names, but the acronyms haven't been completely removed from the code base, event log messages, and so on.

Defining the Data Center Downward

Microsoft, IBM, and many other enterprise-focused companies have long sought to build systems—in the form of hardware, operating systems, and applications—that are resilient against failure or damage. The goal of these efforts has been to bring mainframe-quality uptime to enterprise applications without requiring the overhead and infrastructure required by these traditional systems.

We've all reaped the benefits. The redundant server hardware that's now almost a commodity used to be found only in extremely demanding, budget-insensitive applications such as spaceflight, telephone switching, and industrial control systems. Likewise, applications such as Microsoft SQL Server, Oracle's database applications, and Microsoft Exchange have steadily gained more resiliency-focused features, including clustering, transactional database logging, and a variety of application-specific protection methods (e.g., Safety Net and shadow redundancy in Exchange). In general, these features focus on detecting certain types of failures and automatically taking action to resolve them, such as activating a database copy on another server. However, the next logical step in building resiliency into Exchange required a departure from the previous means of doing so. Exchange needed more visibility into more components of the system, as well as an expanded set of actions that it can take.

Advanced service monitoring relies on three complementary tasks. It has to monitor the state of every interesting component, decide whether the data returned from monitoring indicates some type of problem, then act to resolve the problem. This monitor-decide-act process has long been the province of human administrators. You notice that something is wrong (perhaps as the result of a user report or your own monitoring), you figure out what the problem is, then take one or more actions to fix the problem. However, having humans responsible for that process doesn't scale well to very large environments, such as the Exchange Online portion of Microsoft Office 365. Plus, it's tiresome for the unlucky administrator who gets stuck having to fix problems on weekends and holidays. To address these shortcomings and give Exchange more capability to self-diagnose and self-repair, Microsoft delivered Managed Availability as part of Exchange 2013.

Understanding Managed Availability's Logical Design

Managed Availability is designed around three logical components: probe, monitor, and responder. The probe runs tests against different aspects of Exchange. These tests can be performance based (e.g., how long it takes for a logon transaction in Outlook Web App—OWA—to work), health based (e.g., whether a particular service is currently running), or exception based (e.g., a monitored component generated a bug check or another unusual event that indicates an immediate problem).

In Microsoft's words, the monitor "contains all of the business logic used by the system based on what is considered healthy on the data collected." This is a fancy way of saying that the monitor is responsible for interpreting data gathered by the probe to determine whether any action is required. Because the monitor's behavior is specified by the same developers who wrote the code for each monitored component, the monitor has intimate knowledge of how to tell whether a particular component is healthy.

Managed Availability uses the terms "healthy" and "unhealthy" in pretty much the same sense as people do. If a component is healthy, its performance and function are normal. An unhealthy component is one that the monitor has decided isn't working properly. In addition to the basic healthy and unhealthy states used by the monitor, there are other states that might appear when you check the state of a server. (I'll discuss those later.)

The responder takes action based on what the monitor finds. These actions can range from restarting a service to running a bug check on a server (thus forcing it to reboot) to logging an event in the event log. Although logging an event might not seem like a very forceful step for the responder to take, the idea is that the logged event will be picked up by a monitoring system such as Microsoft System Center Operations Manager or Dell OpenManage, thus alerting a human, who can then take some kind of action.

Understanding Managed Availability's Physical Design

Managed Availability is implemented by a pair of services that you'll find on every Exchange 2013 server: MSExchangeHMWorker.exe and MSExchangeHMHost.exe. MSExchangeHMWorker.exe is the worker process that does the actual monitoring. MSExchangeHMHost.exe (shown as the Exchange Health Manager Service in the Services snap-in) manages the worker process. The pairing of a controller service with one or more worker processes is used throughout other parts of Exchange, such as the Unified Messaging (UM) feature. You don't necessarily need to monitor the state of the worker processes on your servers. However, you should monitor MSExchangeHMHost because if MSExchangeHMHost isn't running, then no health monitoring will be performed on that server. Microsoft doesn't support disabling the Managed Availability subsystem by stopping the MSExchangeHMHost process. If Managed Availability is doing something you don't like, you should tame it with the techniques I'll show you instead of turning it off completely.

The configuration settings for Managed Availability live in multiple places. This can be a little confusing until you get used to it. The local settings for an individual server are stored in the registry at HKLM\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides. When you override a monitor on an individual server (which I'll discussed later), the override settings are stored in that server's registry. Other settings, notably global overrides, are stored in Active Directory (AD) in the Monitoring Settings container (CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft Exchange,CN=Services,CN=Configuration…) All the standard caveats about the need for healthy AD replication thus apply to the proper functioning of Managed Availability.

Running Your First Health Check

The simplest way to start understanding Managed Availability is to use the Get-HealthReport cmdlet. You can use it to learn about the states of all the health sets on the specified server. A health set is a group of probes, monitors, and responders for a component. For example, if you want to check the health sets for the server named WBSEXMR02, you'd run the command:

Get-HealthReport -Identity WBSEXMR02

When you run this command, you'll see output similar to that shown in Figure 1. The deluge of information will probably tell you more than you wanted to know about what Managed Availability thinks of the target server.

Figure 1: Obtaining the States of All the Health Sets on the Current Server

An Exchange component such as the Exchange ActiveSync (EAS) subsystem might have multiple health sets associated with it, and each health set might contain multiple probes, monitors, and responders that assess different aspects of items in the health set. One great example is OutlookRpcSelfTestProbe, which has one probe for each mailbox database on the server. All of those probes roll up into the Outlook.Protocol health set. Each reported health set includes a state (indicating whether it's online, offline, or has no state associated with it), an alert value (indicating whether it's considered healthy or not), and the last time the health state changed.

Understanding the Health Set States

It's useful to know the states that a health set can be in. Obviously, "healthy" is the preferred state. When you see this state, it means that Managed Availability hasn't spotted anything wrong with the components monitored in that health set. When a health set is shown as "unhealthy," it means that one or more items monitored by that health set are broken in some way. As far as Managed Availability is concerned, these are the two most important states. If a health set is healthy, it will be left alone. If it's unhealthy, the responders will be engaged in the order specified by the developers until either the health set becomes healthy again or the system runs out of responses and escalates the issue by notifying an administrator about the problem.

There are four additional states that you might see when examining health reports:

  • Degraded. The "degraded" state means that the monitored item has been unhealthy for less than 60 seconds. If it's not fixed by the 61st second, its status will change to "unhealthy."
  • Disabled. The "disabled" state appears when you manually disable a monitor.
  • Unavailable. The "unavailable" state appears when a monitor doesn't respond to queries from the health service. This is seldom a good thing and warrants investigation as soon as you see it.
  • Repairing. The "repairing" state only appears when you set it as the state for a health set. It tells Managed Availability that you're aware of, and are fixing, problems with that particular component. For example, a mailbox database that you're reseeding would be labeled as unhealthy, so you'd set its status to "repairing" while you're working on it so that Managed Availability is aware that the failure is being addressed.

To indicate that you're manually fixing a problem, you use the Set-ServerMonitor cmdlet. For example, if you need to make repairs on a server named HSVEX01, you'd run the command:

Set-ServerMonitor -Server HSVEX01 -Name Maintenance `
  -Repairing $true

When you're done with the repairs, you'd run Set-ServerMonitor again, setting the -Repairing parameter to $false.

Checking and Setting States

Components are logical objects that might contain multiple services or other objects. For example, the FrontendTransport component is made up of multiple subcomponents, which you can see using the Get-MonitoringItemIdentity cmdlet:

Get-MonitoringItemIdentity -Identity FrontendTransport

However, you can't manage those subcomponents as individual items. Instead, you need to use the Get-ServerComponentState and Set-ServerComponentState cmdlets on the component as a whole.

With the Get-ServerComponentState cmdlet, you can get a list of the components that Exchange recognizes and those components' states. You use the -Identity parameter to specify the server. For example, the following command lists the components and their states for the server named WBSEXMR01:

Get-ServerComponentState WBSEXMR01

Figure 2 shows the results. If you want to see the state of a specific component, you can add the -Component parameter followed by the component's name.

Figure 2: Obtaining the States of All the Components on the Specified Server

You can use the Set-ServerComponentState component to manually change the states of individual components on a server. There are many reasons why you might want to do this, including preparing servers for maintenance or upgrade, or disabling a component that you don't want running on a particular server.

To use Set-ServerComponentState, you must include four parameters:

  • The -Component parameter. You use this parameter to specify the component whose state you're changing. You can specify the name of a subsystem or component (e.g., UMCallRouter, HubTransport) or the special value ServerWideOffline, which indicates that you want to change the state of all the components on the specified server.
  • The -Identity parameter. You use this parameter to specify the name of the server on which you want to change the component state.
  • The -Requester parameter. You use this parameter to specify why you're changing the state. Typically, you'll be specifying the value of Maintenance. The other possible values are HealthAPI, Sidelined, Functional, and Deployment.
  • The -State parameter. You use this parameter to specify the desired state. It can be set to Active or Inactive for all components. Some components support a third state, Draining. If you set a component's state to Draining, it indicates that the component should finish processing existing connections, but it shouldn't accept or make new connections. For example, putting the HubTransport component into Draining state allows it to finish handling existing SMTP conversations, but it won't be able to participate in new conversations.

Although you can run Set-ServerComponentState any time, the most common reason to do so is when you want to tell Managed Availability that you are starting or stopping planned maintenance on a server. Doing so reduces the risk that Managed Availability will change the state of a component while you're in the middle of working on it.

Preparing for Maintenance

In Exchange 2010, you typically use the StartDagServerMaintenance.ps1 script to indicate that you're going to do maintenance on a database availability group (DAG) member server. The process for performing maintenance on an Exchange 2013 DAG member is a little bit different. Here's how you'd put a server named DWHDAG01 into maintenance mode:

  1. Drain the transport queues by running the command:
Set-ServerComponentState DWHDAG01 `
  -Component HubTransport -State Draining `
  -Requester Maintenance
  1. If your server is being used as a UM server, drain the UM calls by running the command:
Set-ServerComponentState DWHDAG01 `
  -Component UMCallRouter -State Draining `
  -Requester Maintenance
  1. Put the server in maintenance mode by running the command:
Set-ServerComponentState DWHDAG01 `
  -Component ServerWideOffline -State Inactive `
  -Requester Maintenance

When you're done, you'd run Set-ServerComponentState on the same components but in reverse order (ServerWideOffline, UMCallRouter, then HubTransport), putting them into the Active state.

Turning Off a Service or Component

Microsoft tests Exchange as a complete set of services, so it doesn't necessarily support turning off individual services, but sometimes you might need to do so anyway. Perhaps you want to troubleshoot some aspect of your server's behavior, or you want to turn off services that you know you won't be using. (The UM services sometimes meet this fate.) If you use the standard Windows service management tools to stop an Exchange service, Managed Availability will see that as a failure and try to turn the service back on, working through its responders as designed. The responders for a component might cause the server to reboot or run a bug check, which could cause problems.

To avoid this situation, you could disable the service using Service Control Manager (SCM), but then Managed Availability will become unhappy and report that the server's health is poor. The best option is to turn off the managed service using the Set-ServerComponentState cmdlet. For example, if you want to turn off the RecoveryActionsEnabled service on the DWHDAG01 server, you'd run the command:

Set-ServerComponentState -Component RecoveryActionsEnabled `
  -Identity DWHDAG01 -State Inactive -Requester Functional

Using Overrides

Managed Availability implements a sort of get-out-of-jail-free card in the form of overrides. An override allows you to change the thresholds used by the monitor for determining whether a particular component is healthy or change the action taken by the responder when a component becomes unhealthy. Typically, you won't have to do this. However, there might be cases when it's necessary. For example, when Microsoft shipped Cumulative Update 3 (CU3) for Exchange 2013, it added a new probe for public folder access through Exchange Web Services that would cause the public folder subsystem to be marked as unhealthy if you didn't have any public folders in Exchange 2013. The fix, as described in the Microsoft Support article "PublicFolders health set is "Unhealthy" after you install Exchange Server 2013 Cumulative Update 3," is to add an override to tell Managed Availability to stop caring about that particular probe.

There are actually two types of overrides: server overrides, which apply to a single server, and global overrides, which apply to all servers in the Exchange organization. You apply overrides using the Add-ServerMonitoringOverride or Add-GlobalMonitoringOverride cmdlet. Here are some examples from the Microsoft Support article:

Add-GlobalMonitoringOverride -Identity `
  "Publicfolders\PublicFolderLocalEWSLogonEscalate" `
  -ItemType "Responder"-PropertyName Enabled `
  -PropertyValue 0 -ApplyVersion "15.0.775.38"
Add-GlobalMonitoringOverride -Identity `
  "Publicfolders\PublicFolderLocalEWSLogonMonitor" `
  -ItemType "Monitor" -PropertyName Enabled `
  -PropertyValue 0 -ApplyVersion "15.0.775.38"
Add-GlobalMonitoringOverride -Identity `
  "Publicfolders\PublicFolderLocalEWSLogonProbe" `
  -ItemType "Probe" -PropertyName Enabled `
  -PropertyValue 0 -ApplyVersion "15.0.775.38"

Note that the cmdlet appears three times with the same value for the -Identity parameter: once to disable the responder, once to disable the monitor, and once to disable the probe for the specified object.

Dealing with Occasional Oddities in Managed Availability

Managed Availability is a complex subsystem, and you might find that it occasionally behaves in ways you don't expect. For example, when you set the state of the FrontendTransport and HubTransport components to Inactive or Draining, you might notice that you're still seeing event log IDs 7011 and 7012, indicating that Managed Availability thinks those services are down. Managed Availability will eventually trigger a responder that restarts the services for you, or you can manually restart them to restore the correct behavior. It's also sometimes the case that other operations confuse Managed Availability so that it isn't aware of, or doesn't report, the correct state of the items it monitors. For example, installing Exchange 2013 CU3 would sometimes make monitored services incorrectly appear to be unhealthy. These problems are usually easy to fix by restarting the affected service or using Set-ServerComponentState. You also have the ability to create overrides for cases when you have no other easy way to fix the problem. Problems like these are pretty rare, and they don't detract from the utility of Managed Availability over the long term.

The Future of Managed Availability

Managed Availability has a huge amount of potential because of the nature of its design. The people who wrote the code that makes up Exchange also wrote the Managed Availability probes, monitors, and responders that monitor it, so as the Exchange code evolves and changes, Managed Availability can keep pace. The idea of having a self-monitoring, self-healing Exchange server is an attractive one, although in its current implementation it's limited to watching individual servers. The existing UI for Managed Availability relies on the Exchange Management Shell (EMS), and that might not change, although hopefully we'll see better integration between the monitoring tools available in the Exchange Admin Center (EAC) and the health reports generated by Managed Availability. As Office 365 expands in scale, Microsoft is likely to continue investing in Managed Availability's ability to correlate health states across larger parts of the Exchange infrastructure, and those improvements will move over to on-premises implementations, too.