Use the server health model to monitor your servers’ performance
Monitoring your network involves a lot more than just keeping tabs on the health of your servers. It’s also important to determine whether your websites, applications, network infrastructure, and servers are functioning 24 x 7. Using a single dashboard for network monitoring makes the task easier.
Before you start monitoring, it’s useful to know what you’re looking for in a healthy server. Server performance is typically assessed using four Key Performance Indicators (KPIs): Processor, Memory, Disk, and Network. We can create a health model for a server that incorporates these components, as well as other factors. For example, is a server healthy when it isn’t running? If it’s configured incorrectly? If its security is compromised? The more conditions we add to our definition of a healthy server, the more useful our health model will be in assessing our servers’ health. A server’s health model is sort of like a painting of what a server should look like—we start with a rough sketch of a server, adding details that help the sketch evolve into a full-color painting of the server.
Using the health model approach lets us provide monitoring not only for servers but also for custom applications, websites, network devices, and many other important aspects of a business. In Microsoft System Center Operations Manager 2007 R2, the server health model focuses on four main areas: availability, configuration, performance, and security. Several KPIs directly determine how well a server is performing—including Processor, Memory, Disk, and Network. The Windows Server Operating System Management Pack for Operations Manager 2007 includes the server health model. One of the methods for displaying Operations Manager’s health model is the Health Explorer interface, which Figure 1 shows. This health model is extremely detailed; for the purposes of this article, let’s focus on how Operations Manager integrates the various KPIs.
Typically, a processor bottleneck is defined as more than 80 percent server utilization for a period of time. Unfortunately this type of bottleneck occurs relatively frequently and can generate a significant number of alerts that might not be actionable by the server. The Operations Manager monitor (Total CPU Utilization Percentage) takes processor monitoring a step further by alerting only when multiple conditions occur. Health states for this monitor are either healthy or critical based on the following conditions:
- Critical state occurs when processor utilization (Processor\% Processor Time\ _Total) is higher than 95 percent for 6 minutes (after three samples on a 2-minute schedule) and when processor queue length (System\Processor Queue Length) is greater than 15 for 4 minutes (after two samples on a 2-minute schedule).
- For all monitors discussed in this article, when the threshold decreases below the levels defined for the critical or warning state, the monitor resets itself to a healthy condition.
This approach minimizes the amount of noise (i.e., nonactionable alerts) by providing an alert when the condition is likely to actually represent a bottleneck on a server versus a temporary spike in processor utilization. Although this approach provides a good starting point for most servers, not all servers are created alike. Some servers consistently experience higher processor workloads (e.g., servers running SQL Server or Exchange Server).
Virtualized servers also often experience higher than average processor interrupt levels that require tuning within Operations Manager. This doesn’t indicate that the virtualized guest OS has additional overhead but rather that this particular counter might not be as relevant (or might have a higher than average value) in a virtualized guest OS.
Operations Manager lets you use overrides to tune alerts to detect different thresholds for different systems or groups of systems. An override changes the default behavior of a rule or monitor for the systems on which the override is applied. For example, suppose you have a computer group that contains all virtual servers. (For information about detecting both VMware and Hyper-V servers, see “Virtual Machine Discovery MP for Operations Manager 2007”) You can target an override to change the thresholds to either a higher or lower level for that group. For processor counters, it’s common to create an override that lowers the thresholds for systems on which processor bottlenecks are likely to occur or to increase the thresholds for the average processor interrupt level.
Operations Manager doesn’t limit the health model for the processor to the Total CPU Utilization monitor. Instead, this monitor is supplemented with the Total Percentage Interrupt Time and the Total DPC Time Percentage counters. These counters can also indicate performance bottlenecks on the processor for a server.
The Total Percentage Interrupt Time monitor indicates the total percentage of interrupt time—which seems obvious; note that logical names are used throughout the health model, to let you easily determine which conditions a monitor evaluates. This monitor’s healthy and critical states are defined as follows:
- Critical state occurs when the Total Percentage Interrupt Time monitor shows greater than 10 percent for 10 minutes (after five samples on a 2-minute schedule).
The Total DPC Time Percentage counter determines how much time the processor spends receiving and servicing deferred procedure calls, which are interrupts that run at a lower priority than standard interrupts. This monitor’s healthy and critical states are defined as follows:
- Critical state occurs when the Total DPC Time Percentage monitor shows greater than 95 percent for 10 minutes (after five samples on a 2-minute schedule).
Operations Manager retains performance information in the operations database (called OperationsManager). This performance information can be used to create graphs for one or more systems. For example, Figure 2 shows processor utilization for several servers. This view is available in the Operations console. The Operations console reads directly from the OperationsManager database and can show data for up to the default retention period, which is 7 days.
Operations Manager also retains performance information in the data warehouse that can be used to provide trending of performance counters over time. Hourly and daily aggregated information is stored in the data warehouse for 400 days by default. Using the data warehouse lets us create reports that show performance information over a longer period of time than is available in the OperationsManager database.
Using the processor performance view or the processor performance report lets us establish a baseline for what the processor utilization looks like for a server or group of servers. We can then use this baseline to override the processor alerts to notify us when we’ve passed beyond what’s considered normal behavior for a server’s processor.
Operations Manager can perform diagnostic tasks, which provide information about what occurred when an object’s health state changes, or recovery tasks, which repair the health state. A built-in diagnostic task occurs when a CPU goes from healthy to critical; the List Top CPU Consuming Processes diagnostic gathers information when a processor changes from a healthy to a critical state. This information can be useful in determining why a server is experiencing a performance bottleneck. Figure 3 shows an example of this automated diagnostic.
Operations Manager provides a solid health model for what a processor’s health should look like. This model is customizable on a per-object basis and is designed to be actionable. By combining both the processor health model and the ability to report trends in performance counters over a period of time, Operations Manager covers the Processor KPI extremely well.
Memory bottlenecks are generally thought to exist when more than 80 percent of memory is committed on the server. The easiest solution is to simply add memory—but this solution is often not viable and sometimes not really necessary. Operations Manager tracks the percentage of committed and available memory and provides an alert when committed memory exceeds 80 percent (by default).
The Percentage of Committed Memory in Use monitor changes the server’s health state based on the percentage of memory committed on the system. This monitor’s healthy and critical states are defined as follows:
- Critical state occurs when committed memory is greater than 80 percent for 6 minutes (after three samples on a 2-minute schedule).
Operations Manager also monitors the amount of memory still available on a server. The Available Megabytes of Memory monitor changes the server’s health state based on the number of available megabytes of memory on the system. This monitor’s healthy and critical states are defined as follows:
- Critical state occurs when the available megabytes of memory falls below 2.5MB for 6 minutes (after three samples on a 2-minute schedule). By default, this value occurs only if a system is truly critical on memory. You might need to override the default value and set it to a larger number depending on your environment’s requirements.Figure 4 shows performance monitoring for a server that’s almost critical on memory but isn’t yet close to the 2.5MB default threshold. To better use this monitor, you should create an override to increase the threshold from 2.5MB to a larger value based on the amount of memory on the server. According to the TechNet article “System Level Bottlenecks”, a consistent value of less than 20 to 25 percent of installed RAM indicates insufficient memory.
The amount of paging is another aspect of memory monitoring that Operations Manager tracks as part of the Memory KPI. In the Windows Server 2008 Operating System Management Pack, this rule is called Memory Pages per Second 2008. Because it’s a rule rather than a monitor, this rule doesn’t affect the health model. However, Operations Manager does gather paging information to provide trending and potential bottlenecks for the OS.
The healthy state for these values varies depending on the types of applications that are installed on the server. For example, applications such as SQL Server and Exchange expand to use nearly all available memory on a system. In most environments, the administrator creates an override to set the memory threshold to between 95 and 99 percent for SQL Server systems and Exchange servers. To accomplish this task, you can use the SQL Server Management Pack’s SQL Computers or SQL 2008 Computers groups and the Exchange Server Management Pack’s Exchange 2007 Computer Group setting. For SQL Servers systems, you can implement a policy to restrict how much memory SQL Server can use; thus, thresholds can be based on an organization’s SQL Server memory policy. In general, a threshold of 99 percent indicates a problem because it implies that the OS is most likely being starved for memory because of application memory requirements.
Just like the processor performance counters, the percentage of committed memory counter is available both in the operations database and in the data warehouse and can be used to provide trending information and to identify a baseline for normal memory utilization on a server. Figure 5 shows the percentage of committed memory for a server over a period of time. Like the processor health model, the memory monitoring functionality within Operations Manager can be easily customized based on the requirements of an object or group of objects and provides another important piece of the overall health model for a server.
Rather than being associated with too much reading and writing from the disk, disk bottlenecks are most often associated with how much free disk space exists on the drive. (Of course Operations Manager measures performance for both these metrics.)
To measure free disk space, Operations Manager tracks free megabytes and percentage of free space on all the drives on the servers it monitors. Operations Manager uses the Logical Disk Free Space monitor to determine the health of the disks it monitors. Like the processor monitor, two values determine when a drive is critical on free space: percentage of free space and megabytes of free space. The default values vary for drives depending on their function for the system. System drives (volumes that contain hardware-specific files needed to start Windows) have different thresholds than nonsystem drives because in general nonsystem drives are larger in size than system drives.
System drives are defined as healthy or in a warning or error state based on the following conditions:
- Warning state occurs when the percentage of free space is less than 10 percent and the actual free space is less than 200MB.
- Error state occurs when the percentage of free space is less than 5 percent and the actual free space is less than 100MB.
Nonsystem drives are defined as healthy or in a warning or error state based on the following conditions:
- Warning state occurs when the percentage of free space is less than 10 percent and the actual free space is less than 2,000MB.
- Error state occurs when the percentage of free space is less than 5 percent and the actual free space is less than 1,000MB.
Operations Manager also tracks how much the OS is reading and writing to the disk, as well as current disk queue length. In addition, health monitoring is available for disk transfers (Average Disk Seconds Per Transfer) and fragmentation level of a drive. Monitoring of disk reads (Average Disk Seconds Per Read) and disk writes (Average Disk Seconds Per Write) is disabled by default but can be enabled through an override.
Disk utilization is determined by the Average Disk Seconds Per Transfer monitor. This monitor’s healthy and critical states are defined as follows:
- Critical state occurs when the average disk seconds per transfer is greater than 50 for 5 minutes (after five samples on a 1-minute schedule).
Fragmentation health is determined by the Logical Disk Fragmentation Level monitor. This monitor’s healthy and warning states are defined as follows:
- Warning state occurs when the percentage of file fragmentation is greater than 10 percent. (This monitor checks health state once a day at 3:00 a.m. on Saturday by default.)
The Logical Disk Fragmentation Level monitor also includes a recovery task called Logical Disk Defragmentation, which is disabled by default. This task can automatically run a defragmentation if the drive exceeds the threshold defined for the monitor. (For more information about this monitor, see “OpsMgr ReSearch This KB – Logical Disk Fragmentation Level is High”)
Operations Manager also checks the availability of logical disks on a system every 5 minutes through the Logical Disk Availability monitor. This monitor provides an alert if a drive disappears or becomes inaccessible by a server that Operations Manager monitors. In most cases this monitor functions well as designed. However, I ran into a situation in which a volume was mounted and dismounted on a scheduled basis for a server. In this case I had to create an override to disable the monitor for that drive.
The Disk KPI is fully covered by the Operations Manager health model. Free space and availability are monitored, and health state is defined—including the amount of data being transferred and even the fragmentation level of the drive.
Operations Manager tracks performance information for network adapters through the following three counters:
- Bytes Received/sec
- Bytes Sent/sec
- Bytes Total/sec
These counters are tracked by default. You can display them through the Operations Manager console or reports. Figure 6 shows the network adapter performance counters. These performance counters don’t have associated monitors, so they don’t affect the network adapter health model.
Operations Manager provides a monitor called Network Adapter Connection Health that can change the health state of a network adapter if it’s removed from a server. This monitor is disabled by default (probably to minimize nonactionable alerts generated by servers in an environment with multiple disconnected network adapters); it can be enabled to provide availability information as part of the Operations Manager health model for servers.
Operations Manager includes a reporting function based on SQL Server Reporting Services (SSRS). Various Operations Manager management packs provide built-in reports that use the SSRS functionality. Figure 7 shows a custom report that was created in only a few minutes using processes discussed in “Creating Useful Custom Reports in OpsMgr: Gathering
Counters” (blogs.catapultsystems.com/cfuller/archive/2010/07/21/creating-useful-custom-reports-in-opsmgr-gathering-custom-performance-counters.aspx). This report shows a single server with three different counters (percent processor time, percent committed bytes, and percent logical disk free space on the C drive).
With the release of Operations Manager 2007 R2, Operations Manager includes the ability to monitor UNIX and Linux systems with an open-source agent that’s deployed to the system(s). Figure 8 shows how Operations Manager integrates the components from a UNIX system into the health model in a similar method to how a Windows server is integrated. Note the Processor, Memory, and Disk KPIs.
For more information about Operations Manager 2007 R2’s cross-platform integration, see System Center Operations Manager (OpsMgr) 2007 R2 Unleashed (Sams, 2010). For information about cross-platform processor monitoring support, see “Understanding CPU Performance Counters on Cross Platform Monitors”.
The Big Picture
Operations Manager’s health model lets us take basic concepts that have historically determined servers’ health, such as “Is the processor running at more than 80 percent?” and expand them into a more comprehensive and customizable model to better evaluate servers’ health. The health model maps out how well servers are performing from an OS level, which is a cornerstone for server health. A server’s OS health model, combined with an application’s health model and other models, creates a much better picture of the server and therefore the environment as a whole.
Operations Manager uses health models that range from the lowest level of an object, such as a disk or processor, to far more complicated structures, such as a distributed application like Exchange or Active Directory (AD). A good analogy is to think of health models as building blocks; Operations Manager uses these blocks to build larger structures, such as a custom-built geographically dispersed application.
Distributed applications in Operations Manager use health models to perform the same tasks as for a server but on a larger scale. These distributed applications can then be incorporated into a dashboard solution that lets you simultaneously monitor your websites, applications, network infrastructure, and servers.