Health Monitor 2.1 oversees your Web clusters

Imagine a fault-tolerant and scalable Web-based application environment that's simple to administer. Picture yourself relaxing at your desk while your servers manage themselves and report any problems that occur. Microsoft Application Center 2000 facilitates the easy creation of Web clusters and the deployment of applications and COM+ components. The software also provides load-balancing capabilities, balancing loads at the server or component level. And Application Center's inclusion of advanced monitoring capabilities through Microsoft Health Monitor 2.1 provides peace of mind. To start monitoring your clusters and automating responses to events and collected data, you need a detailed analysis of Application Center's monitoring capabilities.

Introducing Health Monitor
Health Monitor collects data and configures actions in response to that data. To create this monitoring and reporting environment, you need to configure four primary components: an action, a threshold, a data collector, and a data group. Together, these components make up a monitor. The data collector defines what information the software collects, and how often. The threshold determines the point at which Health Monitor reacts to that data, and the action is the reaction that takes place. To permit easy management and a reporting hierarchy, the software contains these components within data groups. Let's look closely at these components and how they interact to create a monitor.

Actions. An action must be available before you can apply it in a monitor, so you first need to configure an action. You can choose from five preconfigured actions: Email an administrator, Log to offline.log, Log to websitefailures.log, Take server offline, and Take server online. If none of these actions suit your needs, you can use the following templates to create a custom action:

  • Command-line action: You specify a filename and any command-line options.
  • Email action: You specify an SMTP server, message, and recipient.
  • Text-log action: You specify the name of the .log file and the logged message.
  • Windows event-log action: You specify the event type and text of the event that the Windows event log shows.
  • Script action: You specify the script's type and name.

Because these custom actions include the execution of command-line instruction and scripting, a broad range of possibilities are available. An action also contains a configurable schedule that lets you determine when the action is available. Therefore, the software can generate specific actions depending on when a condition meets a certain threshold or can prevent an action from occurring at an inappropriate time. You can configure actions at the data collector level, the data group level, and the server level.

Thresholds. You configure a threshold in conjunction with a data collector. The threshold defines the point in the data collection at which a condition triggers (aka fires) an action. For example, the data collector might be collecting information about processor usage. Suppose you've set a threshold at 80 percent utilization. When processor usage reaches 80 percent, the monitor enters a warning or critical state, depending on the threshold's configuration. You can configure an action for either state.

You can specify how many times the threshold must be reached within a specific period of time before the monitor's state changes and the action is fired. This configurable parameter permits spikes in activity that don't represent a problem. Consider the processor example: You probably wouldn't be concerned about 80 percent processor-usage spikes during application deployment or other intensive tasks. However, you would be concerned if processor usage is continually running higher than 80 percent or spiking repeatedly during typical operation.

The software also lets you create multiple thresholds for different levels of processor activity. Therefore, you can separately configure warning behavior and critical behavior.

Data collectors. As you would expect, the data collector collects the data that the monitor uses. A data collector can be a performance monitor, a service monitor, a process monitor, a Windows event-log monitor, a COM+ application, an HTTP monitor, a TCP/IP port monitor, an Internet Control Message Protocol (ICMP) monitor, a Windows Management Instrumentation (WMI) instance, a WMI event query, or a WMI query. Each monitor's configurable parameters depend on the type of data collected. To achieve the functionality you want, you must accurately collect the correct data.

You can assign one or more configured actions and thresholds to a data collector. If a condition meets a threshold, the monitor enters a warning or critical state (depending on the threshold configuration) and the relevant action occurs. When you configure a data collector, you can define the point at which the monitor returns to a state of equilibrium—either when the condition no longer matches the configured threshold or when an administrator manually resets it.

Data groups. A data group is a container that holds configured monitors. It lets you separate your monitors for administrative purposes and lets you establish a reporting and monitoring hierarchy. When a monitor in a particular group enters a warning or critical state, the data group's state also changes, but you can assign different actions to respond to this change. For example, suppose a monitor changes state and experiences the Take server offline action. The monitor's change of state within its data group also changes the group's state, thereby firing the Email an administrator action at the data-group level. The action at the data-group level occurs regardless of which monitor changed state.

Health Monitor lets you nest data groups so that each parent inherits state changes from its child. The monitored server inherits state changes from the top-level data group, so you can configure an action that will take place if any data group changes state.

Application Center uses Health Monitor to create two default data groups: Synchronized Monitors and Non-Synchronized Monitors. The contents of the Synchronized Monitors data group synchronize across all of a cluster's servers. Because a monitor must reside on a server to be functional, this synchronization can save much duplication of effort. Synchronization within a cluster occurs from the controller to every other cluster member, so you should create the Synchronized Monitors data group on a cluster controller. Alternatively, a monitor might apply to only one server. You should create such a monitor in the Non-Synchronized Monitors data group on the server, and that monitor will be functional on only that server.

When Application Center installs Health Monitor, it populates the Synchronized Monitors container with default monitors for basic monitoring of the cluster members (e.g., processor and memory utilization, HTTP response times) and information about the cluster service. Health Monitor also creates sample monitors for monitoring Application Center and additional products (e.g., Microsoft SQL Server, Microsoft BizTalk Server 2000, Microsoft Commerce Server 2000) and for monitoring such items as COM+ applications and Web sites.

Figure 1, page 10, shows the Health Monitor interface. You can see the hierarchical relationship between the threshold, data groups, and server: The thresholds in the top-right pane relate to the Cluster Service data collector. This data collector is in the Application Center Monitors data group, which is on the homeap server.

You now understand the basic steps of creating a monitor. Let's take a look at how Application Center can help you manage your clusters.

Default Monitors
Application Center configures several default monitors, which contain default data collectors, in the Synchronized Monitors container, although not all the data collectors are enabled by default. In Figure 1, you can see the default Synchronized Monitors. The gray circle icon with the downward-pointing white arrow, which you see over some of the data collectors in the left pane, denotes a disabled data collector. To enable a data collector, right-click the icon and choose All Tasks. On the resulting menu, click Disabled to clear the check box. Let's look closely at one of these default monitors.

Notice that the Synchronization Session Errors data collector contains four thresholds. The first threshold checks the value of the error code that WMI reports. If the value is anything other that 0, a failure has occurred in WMI and Health Monitor considers this error critical. This threshold is standard for any data collector that uses WMI.

Figure 2 shows the properties of the Synchronization Session Errors data collector. Because the data collector uses WMI, its properties show the WMI namespace and class. The collected data applies to Application Center replication sessions and resides in the Application Center namespace. The Properties window shows all the properties that the data collector can monitor, and a selected check box indicates a monitored (i.e., collected) property. This data collector is monitoring four properties (i.e., EventId, ReplicationJobID, Status Message, and Type), one for each threshold, but you can set multiple thresholds for one collected property.

The WMI Query Language (WQL) event query, in the lower half of the dialog box, isolates pertinent data. This query can apply to an intrinsic event, which lets you identify a generic creation, deletion, modification, or operation event with an instance, class, or namespace. If you want to create a more specific query, you can create an extrinsic event, which lets you define an event specific to your needs. In Figure 2, the monitor is configured to respond to the generation of specific events rather than any event, so the monitor uses an extrinsic query. As you can see, the data collector is specifically interested in event IDs 5037 (successful completion of synchronization) and 5043 (successful integration of synchronization changes), and the synchronization's Type value is 1, which indicates a failure. Each threshold in the data collector responds to one of these events. The event IDs reset the state to OK, and the Type changes the state to critical. The configured action at the data-collector level is Email an administrator. No default actions are configured at the data-group level for any monitors in the Application Center Monitors data group. The Application Center Log Monitors data group contains collectors that retrieve information about Application Center's logging functionality. The Microsoft Data Engine (MSDE) 8.0 stores the logged information, and these data collectors and thresholds ensure that the logging services start and log correctly and monitor the size of the database.

The remaining data collectors in the Application Center Log Monitors data group perform the following functions:

  • check for the successful starting of the cluster service
  • check for failure of Health Monitor or Request Forwarding (i.e., the ability of Application Center clusters to return a client to a previously configured cluster node, if necessary)
  • check whether a server is reporting event ID 4016, which indicates that the server is offline because of the drain process (which permits connected clients to remain online for a period of time to finish a task but refuses new client connections)

The software automatically creates three more data groups: Online/Offline Monitors, System Monitors, and Web Site Monitors.

Online/Offline Monitors. The Online/Offline Monitors data group contains one data collector and one threshold that uses WMI to determine whether the Web service has started. If the Web service hasn't started, the threshold puts the monitor into a critical state. The action for the data collector is to email an administrator. The Offline/Online Monitors data group has an additional action, which is to take the server offline in the event of a critical state and online for an OK state. Therefore, any data collector placed in this group with a threshold configured to change the state of the monitor to critical will automatically take the affected server offline.

System Monitors. The System Monitors data group contains three data collectors—LogicalDisk, Memory, and Processor. The LogicalDisk data collector isn't properly configured unless you've enabled logical disk counters. (By default, Health Monitor enables physical disk counters but not logical disk counters.) To enable logical disk counters, type

diskperf —yv

at a command prompt and reboot the server. The threshold monitors your hard disk's free space and puts the monitor into a warning state if free space dips below 10MB (by default). However, no actions are automatically configured for this warning state. In the event of a critical state, the action at the data-collector level is to email an administrator. The critical event occurs in the event of a WMI failure. No actions are configured at the data-group level.

The Memory data collector uses the Performance Monitor Pages/Second counter. The threshold for this collection is set at 500 pages per second. If more than 500 pages per second occur, the monitor enters a warning state. In the event of a critical state, which an error in the data collection causes, the action at the data collector level is to email an administrator.

The Processor data collector uses processor-usage performance counters. The threshold is set at 90 percent, at which point the monitor enters a warning state. Again, in the event of a critical state, which an error in the data collection causes, the action at the data-collector level is to email an administrator.

Web Site Monitors. The Web Site Monitors group contains one data collector, which uses the HTTP monitor to query URLs and check response times. You can configure security information into the collector if it's necessary to access the site. Three thresholds are associated with this collection. One of the thresholds checks for response times greater than 30 seconds, another checks for the return of an error code higher than 400 (indicating a failure), and the third checks for a failure in the data collection. All these thresholds change the monitor to a critical state. A change to a critical state fires two actions at the data-collector level: Email an administrator and Log to websitefailures.log. No actions are configured at the data-group level.

Create a Custom Monitor
If you performed a custom Application Center installation, Health Monitor can create sample Application Center monitors as well as monitors for other applications and services, as Figure 3 shows. These monitors are helpful if you choose to install a specific application in your Application Center environment. The System Monitors and Web Site Monitors provide functionality in addition to the functionality that the default monitors provide. All the sample monitors are disabled by default.

If the default and sample monitors fail to provide the functionality you require, you can create a custom monitor. To create a custom monitor, follow these steps:

  1. To create a new action, right-click Actions under the server icon in the left-hand pane, then select New. In the resulting dialog box, choose the type of action you need. Select the type of action and configuration (e.g., command-line parameters, email recipient, script name, text file name, event type to be fired).
  2. Decide whether you want the monitor to reside on one server or to synchronize across all the servers in the cluster. Right-click Synchronized Monitors or Non-Synchronized Monitors, and choose New, Data Group from the menu.
  3. Create the data collector by right-clicking the data group and selecting New, Data Collector. The resulting menu lists all possible data-collector types. Select the appropriate data-collector type from this list.
  4. Create the threshold to define the point in data collection at which the monitor changes state, as well as the state it will change to. Right-click the data collector and choose New, Threshold from the menu. Figure 4 shows the resulting Threshold Properties dialog box, in which you configure the threshold. You can access this dialog box later by right-clicking a configured threshold and choosing Properties.
  5. Choose the actions that need to fire when the state changes and at what level (i.e., data collector, data group, or server). A parent inherits changes in state; therefore, a change in the threshold will create a change in the data collector, which will create a change in the data groups, which will create a change in the server. You can create actions at each of these levels.

Delve Deeply
Web sites and Web-based applications are becoming standard components of corporate networks, as well as essential pieces of the Microsoft .NET strategy. Application Center provides impressive creation, management, and deployment capabilities for this environment, but to delve deeply into your Web presence, you need Health Monitor's comprehensive monitoring capabilities. Only then will you be truly confident in the availability and performance of your Web environment.