Use the ESM console to monitor your servers

In "Managing Exchange 2000, Part 1," February 2001, I reviewed how Microsoft Exchange 2000 Server has embraced the Microsoft Management Console (MMC) framework, and I described the interaction between Exchange 2000 and Windows 2000's Active Directory (AD). I also examined the division of server-management tasks, which Exchange 2000 performs through the Exchange System Manager (ESM) console, and user- related management tasks, which Exchange 2000 performs through extensions to the standard MMC Active Directory Users and Computers console.

All this knowledge lays a solid foundation for operating Exchange 2000 servers and introduces you to your next task: keeping all your Exchange Server machines running. In this installment, I explain how to monitor servers and what type of status information you can expect to retrieve from the components that represent a healthy server. I also introduce some of the Windows Management Instrumentation (WMI) providers that enable access to Exchange 2000 data.

Back to Basics
Exchange Server has always provided a mechanism to monitor servers. Exchange Server 5.5 implements a primitive form of the Ping command in server and link monitors to determine whether servers are currently reachable on the network; this pinging also informs you whether the set of services (e.g., the Information Store—IS, Message Transfer Agent—MTA) that constitute a fully running Exchange Server system are active. Conceptually, this procedure hasn't changed in Exchange 2000—you can still monitor and retrieve status information from servers. The difference lies in how you go about monitoring servers, the data you can retrieve from servers, and the interfaces that Exchange 2000 uses. (Exchange 2000 supports a full set of documented interfaces that you can exploit to better integrate Exchange Server into an overall systems-management strategy. This point hasn't been lost on third-party developers, who are upgrading their products to take advantage of the new interfaces.)

The ESM console's Monitoring and Status node replaces the server and link monitors that previous Exchange Server versions support. This node lets you set up email or script notifications for events that have occurred on monitored servers, as well as display the status of the servers and connectors within a routing group.

Figure 1, page 106, shows the ESM console, which is the basic working environment you use to manage and monitor Exchange 2000 servers. You can select the Notifications node to expose the set of notifications that operate from a server; you can select the Status node to reveal information about the servers and connectors in the server's routing group.

Monitoring Servers
Exchange Server 5.5 uses server monitors to periodically confirm that services are running as expected on a target server. Exchange 2000 implements the same basic principles but now refers to server monitors as notifications. You establish a set of conditions that you want to monitor on a server and assign a server to monitor those conditions. Exchange 2000 uses remote procedure calls (RPCs) to retrieve information from a server, so you can monitor a server only when the network connection between the two servers supports RPCs. This requirement is a design consideration when you decide which server to use as a monitoring base.

Exchange 2000 supports both email notifications and script notifications. An email notification results in Exchange 2000 sending an email message to a predefined set of email addresses. A script notification invokes an executable or Windows Script Host (WSH) script to perform required operations. Clearly, a script notification—which can perform several different programs and other operations in sequence should an error condition arise—is much more powerful than an email notification. As the Monitored Items column in Figure 2, page 106, shows, the QEMEA-ES1 server is monitoring itself (i.e., This server) and two other servers (i.e., QEMEA-DC1 and QEMEA-ES0). If a specified condition, such as disk space dropping under a predefined threshold, occurs on any monitored server, the monitor will send an email message. (A systems failure can stop Exchange Server dead, so never rely completely on the arrival of an automated notification message.)

You can define monitoring parameters for a server in two ways. First, you can expand the Administrative Groups folder, select a server from the list, open the Properties sheet, and go to the Monitoring tab. Alternatively, you can open the ESM's Status node, select a server from the right pane, and open the server's Properties sheet. Either method opens a Properties dialog box similar to the one that Figure 3 shows. (The disks that you can monitor for the Free space threshold condition differ from server to server. This condition is important because services such as the IS or MTA will stop when a disk that they use runs out of free space.) Table 1, page 108, lists the conditions that you can monitor.

The Disable all monitoring of this server check box permits or prevents server monitoring. The check box is cleared by default, so servers automatically publish monitoring data through WMI. Don't select this check box unless you have a good reason for not monitoring the server.

After you establish monitoring conditions, Exchange 2000 will fire notifications if the server meets the specified threshold condition. Figure 4, page 109, shows how the ESM defines parameters for an email notification. The Servers and connectors to monitor drop-down list includes options such as This server, All servers, Any server in the Rout-ing Group, and Any connector in the Routing Group. The Customize button lets you select a specific set of servers from a dialog box that lists all the known Exchange Server machines in the organization.

The ESM validates the To and Cc fields against AD. You can send alerts to users, contacts, or groups. If you want to send the alert to a special address, such as a Short Message Service (SMS) or Wireless Application Protocol (WAP)-enabled cell phone, you need to create a contact and specify the email address to generate the SMS or WAP message. (Typically, you can send messages through SMTP, so this requirement isn't a big problem.)

You can edit the subject and content of the message as long as you're careful not to change the predefined fields that Exchange Server will insert values into. TargetInstance references the name of the server being monitored and, as Figure 4 shows, has several properties that Exchange Server uses for reporting. These properties include QueuesStateString, which Exchange Server uses to insert the current status of both the SMTP Routing Engine and the X.400 MTA message queues.

Figure 5, page 109, shows an example of the message that Exchange 2000 will send to the nominated email addresses if it detects one or more exceeded thresholds. The details are limited but certainly enough to alert you to take action, if only to gather additional information about the problem. In this case, the message shows that the monitored services are running but that a backlog of messages has built up on one or more queues and that the report thresholds for Drives (i.e., disk space), Memory, and CPU (i.e., percentage of CPU utilization) are in an Unknown state. At first glance, the primary problems seem to be with the system's resources; exceeding the CPU threshold or a lack of memory might be the root cause of Exchange Server's inability to clear the message queues. After I connected to and examined the server, everything seemed to be in order and the queue had cleared—which proves that you can expect an occasional false alarm. Better that the alarm sounds than you experience a system failure.

Figure 6 shows an example of the message you'll see if Exchange Server detects an unavailable connector. The importance of a good naming convention for connectors is evident. A large organization can operate many connectors of all types, and if you can't determine what type of connector is having a problem and what purpose the connector serves, the person who receives the notification won't be able to take fast and effective action. In this example, I know that the problem is a Routing Group Connector (RGC) connecting the Hub routing group to the France routing group. RGCs are unidirectional, so the problem most likely lies with the originating routing group, and the first step in problem resolution is to examine that group's bridgehead servers.

Viewing Server Status
Think of Status node connector information as roughly equivalent to a snapshot of the routing group's Link State Table, which the SMTP Routing Engine maintains in memory and which typically isn't visible to an administrator. (You can use the Winroute utility, in the \support\utils\i386 directory of the Exchange 2000 CD-ROM, to gain a more comprehensive view of the Link State Table.) Exchange 2000 uses a mechanism called Link State Routing as the basis for routing decisions for messages. This device replaces the more static view of available routes that the Exchange Server 5.5 Gateway Address Routing Table (GWART) implements. Link State Routing uses a simplified form of Dijkstra's algorithm to ensure that messages follow the optimal path to their destination. The SMTP Routing Engine bases the decisions it takes to find that path on the data in the Link State Table, which Exchange 2000 updates dynamically as the underlying network changes or as you add new connections. For example, if a network link becomes inoperative and prevents an SMTP connector from sending messages to a specific SMTP domain, the routing group that discovers the failure sends link state messages reporting the failure to all the other routing groups in the Exchange Server organization. Then, the routing master in each group generates new routing data.

The ESM retrieves status information from a specific server that acts as your gateway to the monitoring environment. Microsoft designed Exchange 2000 to work in a distributed, networked environment in which servers might not always be connected, so the ESM's view is only 100 percent accurate for the server that's providing the monitoring data. The ESM fetches configuration data about the Exchange Server organization from AD on a specific Global Catalog (GC) server. Therefore, you might see different views of your organization when you connect to different GCs—a good indication that AD replication isn't working as it should or at all—and that you need to pay attention to how replication data flows between domain controllers (DCs) and GCs.

When you connect to a server, its name appears at the top of the ESM's Status pane (as Figure 1 shows). The ESM shows all the connectors available to the administrative group to which that server belongs, as well as the member servers of that administrative group. However, some common exceptions—which initially confused me—exist. For example, when an Exchange 2000 server connects to the rest of the organization through a connector that an Exchange Server 5.5 machine hosts, ESM doesn't show that connector. This situation occurs in mixed-mode organizations. However, as you phase out Exchange Server 5.5 machines, the Exchange 2000 servers take over responsibility for hosting connectors and the problem goes away. Also, you might see connectors that don't belong to the administrative group when you use those connectors for public- folder referrals (the new term for what Exchange Server 5.5 called "public folder affinity"). A referral simply means that Exchange Server is routing a request to access public-folder content across a connector to a server that holds a replica that contains the desired information. A routing group can inherit the ability to use connectors from an intermediate connection, which leads to the unexpected appearance of these connectors. Although the referral mech- anism is a little more complicated than Exchange Server 5.5's mechanism, the Winroute tool provides an excellent insight into the process.

If you have Exchange Administrator or Exchange View-Only Administrator permissions on another server, you can right-click the Status node and select the Connect to option to connect to that server. Although the ESM refreshes the status information at frequent intervals, the data is largely static. To manually refresh the information, press F5 or select Refresh from the Monitoring node's context-sensitive menu. This action forces the ESM to query the Link State Table and the connectors' configuration information, which also resides in memory.

Status Problems
Let's look at a situation in which a problem might have occurred. In the Status pane in Figure 1, the QEMEA-DC1 server displays a status of Unknown. This status doesn't mean that the server is down but rather that the ESM failed to establish a connection with QEMEA-DC1 during the ESM's most recent attempt to get a message to the server. The network connection might permit a ping to the target server, and all the Exchange Server services might be running, but the ESM simply wasn't able to transmit a message. The explanation might be as simple as a temporarily saturated network, in which case the server's status will shortly return to Available.

A server status of Unreachable, however, announces a more severe problem: The ESM hasn't been able to contact the server at all and can't verify whether Exchange Server is running. Figure 7 shows that QEMEA-DC1 is Unreachable and that the problem has affected the other servers' status. Servers QEMEA-ES0 and QEMEA-ES1 have reportedly reached a critical queue length, and an RGC is unavailable because QEMEA-ES0 is the local bridgehead. Interestingly, the peer RGC is still available because the France routing group can still send messages to a server in the local routing group.

When you see a server in an Unknown or Unreachable state but you know that a network link is available and that the server appears to be running normally, the source of the problem might be one of the following factors.

Problems with the System Attendant service. The System Attendant service is the Exchange 2000 component responsible for administrative activities. None of the other services can start without the System Attendant, and all other services provide information to the System Attendant. The service also takes care of background processing such as executing Lightweight Directory Access Protocol (LDAP) queries to build address lists and synchronizing with the Microsoft IIS metabase. Exchange Server can't run when the System Attendant is inoperative.

Problems with the Microsoft Exchange Routing Engine service. When the Routing Engine isn't running, status information about connectors and queues is unavailable. Failure of this service also stops any messages from flowing in or out of the server, so the service is crucial to a messaging system.

Problems with the WMI service. WMI, which is new to Win2K, is a standard mechanism for applications and system services to report status information through providers (i.e., software components that collect application data and report it in a well-defined manner). Monitoring applications can then collect and report data from multiple sources. For example, the ExchangeRoutingTableProvider collects information about connector status from the Routing Engine's Link State Table, whereas the ExchangeClusterProvider collects the server status. The ESM also uses the Exchange Server WMI providers elsewhere. For example, the ESM gathers information about queued messages from the ExchangeQueue provider.

By default, all Win2K servers support Win2K Server Terminal Services, so the easiest way to confirm that these ser-vices are running is to connect to the target server and start the Services console under Programs, Administrative Tools, Services.

Problems with the monitoring parameters. Monitoring parameters might be causing the ESM to flag too many error conditions. To get more information about the server that's reporting a problem, select and right-click the server from the Status pane and select Properties. The resulting dialog box reveals the server's monitoring parameters, which Exchange 2000 stores as part of the Exchange Server configuration information in AD.

I've also discovered that clustered Exchange 2000 servers seem to regularly generate spurious notifications alleging that many of the monitored thresholds have been exceeded. (Not many Exchange 2000 clusters are yet in production, and I'm sure Microsoft will track down the underlying bugs that cause these messages.)

Still the Same
Exchange 2000 improves Exchange Server's previous monitoring capabilities and incorporates support for new components such as the SMTP Routing Engine. But the most exciting news isn't obvious when you work with the ESM. I briefly touched on the purpose of WMI providers. A set of well-documented interfaces accompany these providers, and the combination paves the way for third-party monitoring tools to access and use Exchange Server data—and for you to write simple routines to collect that data and use it as you see fit. In the third article in this series, I'll examine WMI providers, programmable Collaboration Data Objects for Exchange Management (CDOEXM), and how you can manipulate these interfaces to retrieve custom information.