Improve your management of MSCS clusters

To avoid server downtime, many IT shops are turning to Microsoft Cluster Services (MSCS) to meet high-availability needs. MSCS's purpose is simple: Maximize the availability and scalability of your mission-critical applications through fault-tolerant groupings of servers. (For more information about MSCS, see Richard Lee, "MSCS Update," June 1998.) If the initial deployment goes well, you might decide to add clusters across your enterprise. As your clustered servers multiply, you'll begin to reach the limitations of Cluster Administrator, the tool that Microsoft provides to manage and monitor MSCS clusters.

VERITAS Software's ClusterX 3.0.1 for MSCS lets you more easily manage multiple clusters than does Cluster Administrator. ClusterX also offers a centralized UI, management utilities, reporting capabilities, and enhanced functionality.

ClusterX Architecture
ClusterX's client-server architecture includes a GUI client that communicates with server-side agents residing on each of your MSCS cluster nodes. The ClusterX client provides the UI; VERITAS' licensing lets you install as many clients as you need to administer your enterprise. The ClusterX client runs on any flavor of Windows 2000 or Windows NT 4.0 (Service Pack 5—SP5—or later), and you can install the client directly to an MSCS node. MSCS nodes and ClusterX clients use remote procedure calls (RPCs) to communicate over named pipes; this process performs well on a LAN but is noticeably slower on low-speed WAN links. In addition, using the ClusterX client to access MSCS nodes behind a firewall requires you to open the necessary ports to support RPCs, a process that doesn't fit all security models.

After you set up and start the ClusterX client, you'll notice that the UI is more flexible and provides more information than Cluster Administrator. One of the client's strongest features is an interface that lets you simultaneously view multiple clusters, even clusters in different domains. You can also run the client as a Microsoft Management Console (MMC) 1.1 or later snap-in.

The ClusterX node service provides ClusterX's server-side functionality. You deploy the node service from the ClusterX client to clustered nodes throughout your enterprise. After the node service is installed, it sends information that ClusterX clients use for notification, diagnostics, reporting, and logging. The node service can also extend MSCS cluster functionality by letting you set policies for group balancing and load balancing. The node service uses SNMP, which allows integration with management frameworks such as Hewlett-Packard's (HP's) OpenView ManageX, Tivoli Systems' TME, and Computer Associates' (CA's) Unicenter TNG.

Testing ClusterX
I tested ClusterX in a small network consisting of two 2-node MSCS clusters. One cluster ran on two HP NetServer LT 6000r servers, each with six Intel Pentium III Xeon 550MHz processors and 1GB of RAM. These servers attached through an HP fiber channel controller to a NetServer rack storage unit that housed twelve 9GB SCSI drives and acted as the shared storage for both cluster nodes. Another cluster ran on two Dell PowerEdge 1300 servers, each with one 450MHz processor and 128MB of RAM. The Dell servers used Adaptec AHA-2940 SCSI host adapters to attach to two external 9GB SCSI drives. All servers ran Win2K Advanced Server.

ClusterX came on one CD-ROM and provided documentation in PDF format. The 325-page User Guide and the 65-page Getting Started Guide were well written and provided adequate detail for me to set up ClusterX.

I used a setup wizard to install the ClusterX client on a Digital PC 3000 desktop PC running Win2K Server with a 300MHz Pentium II processor and 128MB of RAM. The client's base hardware requirements are a 233MHz Pentium processor, at least 20MB of hard disk space, and 64MB of RAM. Your system will need more memory for managing additional clusters; during testing with only two clusters, the client needed 22MB of memory.

After I installed the client, it scanned the network for clusters and prompted me to install the node service on the cluster nodes it discovered. The client performed well but experienced long pauses while the screens refreshed or it polled clusters for new information. (VERITAS says that a 90- to 150-second pause during startup is typical.) By default, ClusterX prompts you to install the node service in locations where the node service isn't installed, but you can configure the client to silence that prompt. I installed the node service on the four nodes in my test network, and I installed the MSCS SNMP extension agent as well. To complete node-service installation, I provided a domain account and password with administrative privileges and logon-as-service rights. To monitor installation progress, I clicked the ClusterX client's Command Execution tab and watched as each node-service installation command completed.

ClusterX Client GUI
The ClusterX client's GUI, which Figure 1 shows, consists of two panes. The left pane (i.e., Cluster View) displays a hierarchical list of available domains, clusters, nodes, and groups. The right pane (i.e., Results View) has nine tabbed views and displays information according to which tab you've selected and the object you've highlighted in Cluster View. In Results View, you can stack multiple tabbed views. This stacking feature let me see Cluster Status and Audit Log (i.e., cluster service-related application event-log entries) at the same time.

Cluster View's intelligent hierarchy of cluster elements improves on the views that Cluster Administrator provides. Cluster View shows you active cluster groups, their resources, their status, and the node to which groups and resources belong. For example, as Figure 1 shows, ClusterX uses red down arrows in Cluster View and Results View to signal that the SQL7 group is the TIERRA cluster's failed component and that certain resources in that group are online pending, creating the failed condition. If you right-click the troublesome resource and choose Failure Analysis from the Advanced Commands menu, a wizard will try to diagnose and repair the problem. The wizard executes a logical sequence of checks and verifies dependencies, registry entries, and other critical resource properties. When the wizard finishes, it prompts you to repair problems that it found. I tested Failure Analysis on a clustered share that was dependent on a failed physical disk resource. I let the wizard resolve the problem, and it succeeded in bringing the failed physical disk dependency online.

Results View's tabs provide a variety of useful information. The Cluster Status tab shows pertinent cluster information, such as cluster status, CPU load, uptime statistics, and the ClusterX node-service version installed on each node.

The Hardware Status and Application Status tabs use graphical images and icons to represent clustered hardware and applications; these tabs mark failed components in red. Hardware Status images are oversized, and I had to scroll down to see only two clusters. When you need to simultaneously view many clusters, the oversized images are cumbersome. Application Status presents an aggregate view of all clustered applications in your enterprise and their status.

The Configuration tab displays the configuration of clusters and the components that make up clusters. When you select an object in Cluster View, the Configuration tab displays a listing of components that comprise that object. For example, if you select a group object in Cluster View and click the Configuration tab, you will see a list of resources that belong to that group.

The Dependencies tab, which Figure 2 shows, displays a helpful view of your clustered applications. ClusterX simplifies resource-dependency views by presenting dependencies in a graphical tree. Clicking any object highlights its dependency paths; in Figure 2, I selected Microsoft Exchange Internet Mail Service to highlight paths to the resources on which it depends. This feature is helpful for troubleshooting because you can quickly trace dependency paths to isolate failed resources. You can also drag objects within the view to establish and reassign their dependencies.

The Command Execution tab posts commands that you execute from the ClusterX client. The tab displays a history of executed commands and commands that are awaiting execution. You can use the tab to view the progress and ultimate success or failure of active commands, as well as error messages they cause. I began to rely on the Command Execution view early in my ClusterX test because the GUI didn't always provide feedback on the results of my commands. For example, I tried to delete a resource from within the GUI, but the resource remained in place. The GUI offered no information about why my command didn't work. When I checked the Command Execution view, I found that the delete command had produced an error message stating that the deletion couldn't occur because the resource was online.

The Audit Log tab is a consolidated list of cluster service-specific application event-log entries that Cluster X collects from all agent nodes. You can filter log entries according to clusters, nodes, and events.

The Report tab offers you 16 preconfigured reports, such as uptime statistics, clustered applications, group and resource information, allocated IP addresses, and shared disk usage (shared disk usage requires that you activate Performance Monitor disk counters on your nodes). The Report tab also includes diagnostic reports, such as the cluster consistency check, which tries to identify configuration inconsistencies in clusters. Each report generated HTML-based output that I could save as a file. Unfortunately, ClusterX can't schedule reports and automatically save the output to a directory.

Group Balancing and Load Balancing
The Utility tab lets you access some ClusterX features that extend MSCS's functionality. ClusterX utilities help with cluster configuration backup scheduling, restores, and clustered printer creation. Additional utilities let you perform group balancing and load balancing on two node clusters. Group balancing lets you define primary and secondary cluster groups for priority-based load balancing. Primary groups get highest priority for resources. To make all system resources available for the primary group during a failover, secondary groups go offline until the cluster is back online. To set up group balancing, you drag cluster groups into primary or secondary group containers for each cluster node. After you assign a cluster group to a node, ClusterX downloads to that node a script that executes the prioritized failover scheme that you configured. Figure 3 shows the Utility tab and details of the Group Balancing Configuration.

Cluster load balancing's concept is simple: You set high and low CPU thresholds for each node in a cluster. If one node breaches the high threshold and the other node falls below the low threshold, the cluster group that you specify when you configure group balancing will move from the busy node to the free node. You can also specify how often ClusterX checks whether nodes are operating within the CPU thresholds. The configuration and deployment of ClusterX's load-balancing component took me a couple of minutes to accomplish. After installing the component, I generated a heavy load on one cluster node. After the node service finished the CPU sampling cycle (I specified four samples at 60-second intervals), the node service moved my specified cluster group to the other node.

Occasional Glitches
My experience with ClusterX was positive, although I found occasional glitches in the product. The ClusterX GUI was often slow to refresh screens. Every time I launched ClusterX, I waited an average of 2 minutes for the GUI to gather and display cluster information, so I wonder how the GUI performs in large environments that have dozens of clusters. The ClusterX node service didn't start reliably on one of my clusters, and incorrect error messages occasionally appeared when I created a resource. However, the reliability of many other ClusterX features overshadowed these problems.

ClusterX's value to you will depend on your environment. The native Cluster Administrator has limitations, but it's functional for smaller environments and it's free. Thus, organizations with fewer than five MSCS clusters might not justify spending several thousand dollars to purchase ClusterX. However, larger organizations could easily justify ClusterX's cost because the product can reduce the administrative overhead of managing and monitoring multiple clusters. In addition, ClusterX's advanced reporting capabilities and load-balancing features can help organizations meet the obligations of their service level agreements (SLAs). ClusterX's attractiveness depends on whether your current method for administering MSCS clusters is manageable or out of control.

ClusterX 3.0.1 for MSCS
Contact: VERITAS Software * 407-531-7501 or 800-327-2232
Price: $5000 base package includes support for multiple consoles and two cluster nodes; support for additional nodes is available
Decision Summary:
Pros: Provides a unified interface for administering multiple clusters; adds functionality, such as load balancing and group balancing, to Microsoft Cluster Services; creates HTML-based reports for diagnostic, trending, and status information; displays easy-to-read graphical representation of resource dependencies
Cons: GUI can be sluggish and slow to refresh; Hardware Status view is impractical to use for viewing several clusters