You can increase NT's reliability

Microsoft Cluster Server (MSCS) can deliver excellent Windows NT system availability for a reasonable cost. From my experience deploying several MSCS clusters that are achieving near-perfect reliability (i.e., running with only about 0.001 percent of unplanned downtime), I know that you can use MSCS to make NT fault tolerant. However, managing a dependable MSCS cluster (i.e., a cluster that offers services at all required times) involves more specialized preparation, implementation, and maintenance than managing two standalone servers.

MSCS is a component of NT Server, Enterprise Edition (NTS/E), and Microsoft also includes MSCS in Windows 2000 Advanced Server and Win2K Datacenter Server. Win2K MSCS is nearly identical to the NTS/E version except that MSCS supports more than two servers in a cluster (e.g., Datacenter will support a 4-node cluster). Most of the information in this article applies equally to Win2K and NT clusters, although concerns that I don't address here might arise from Win2K clustering. If you're considering an MSCS cluster implementation in the near future, I recommend that you use NTS/E.

Preparation
MSCS clusters can provide high availability, although not to the degree that very specialized—and very expensive—high-end non-NT systems offer. After all, you're still dealing with NT's security vulnerabilities, hotfix reboots, and volume hardware. Therefore, you must evaluate your requirements and environment before you move ahead with MSCS, then decide whether the product will offer sufficient functionality to meet your needs.

Evaluate your business needs. If your business demands a system that is available 24 * 7, or if your business or profile makes you a target for attack, an NT-based solution isn't for you. If you can occasionally take down your system for brief maintenance, and if your business can tolerate a few minutes of unplanned outage each year, an MSCS cluster might be appropriate.

Non-business-critical systems can also be good candidates for clustering. Each server in an MSCS cluster can run different applications and provide failover for the other servers. The security of having a managed hot standby for each server can justify the incremental cost of configuring two otherwise standalone servers as a cluster.

Evaluate your applications. Some applications are less suitable for a clustered environment than others. Suppose you install an application on both cluster servers. The application runs on one server and is dormant on the second server. If the active server fails, the dormant version on the second server will ideally start automatically. But some applications (e.g., databases that don't implement automatic integrity recovery) can't recover automatically and so aren't optimal choices for a clustered environment.

If you decide to install on a cluster an application that can't recover automatically, you'll need to intervene to recover the application if it fails. Document the required steps, and thoroughly test the recovery process. Alternatively, if you can program the process as a script, you can configure MSCS to execute the script before attempting to bring the application online. Again, comprehensive testing is essential. If the script fails, you'll need a tested manual process to bring up the application.

Some proprietary applications write information to the Registry as they execute. In a clustered environment, the MSCS service replicates this information in the failover server's Registry. However, a systems administrator must exactly describe those Registry keys in the application's cluster resource definition. This task requires detailed application knowledge or documentation, so I recommend against clustering applications that write to the Registry while they're running.

The best candidates for clustering are applications that maintain configuration and state information on the shared-disk storage: Examples include file and printer shares, Microsoft IIS, Microsoft SQL Server, and Oracle databases. If you use Oracle databases, I strongly recommend that you also install Oracle's Fail Safe product, which creates an Oracle Database cluster-resource type and provides useful tools to integrate Oracle databases into an MSCS environment.

Evaluate your hardware. Although MSCS lets you cluster dissimilar servers, I recommend that you use identical servers in a cluster whenever possible. Doing so lets you identically configure and manage clustered servers, simplifying administration and increasing the likelihood of successful failover.

To leverage resources, you can run different applications on each server. However, be sure each server has sufficient capacity to run all applications if one server fails; otherwise, you'll need to accept an increase in response time and a reduction in user population while all the applications temporarily run on one server during failover.

The choices for cluster storage architecture are SCSI or fibre channel. SCSI is economical and established; fibre channel is expensive and relatively new but promises better performance and reliability than SCSI. Microsoft has also mentioned (e.g., in TechNet presentations) that fibre channel will be the primary focus for the company's future clustering solutions. (For a detailed comparison of SCSI and fibre channel, see Dean Porter, "Fibre Channel, SCSI, and You," September 1997.) I recommend that you use fibre channel if it's within your budget.

Implementing a cluster doesn't mean that you can neglect server and storage resilience measures. Several factors, such as hardware-component resilience, determine your system's overall availability; you need to make each system constituent as reliable as possible— don't depend on cluster software to come to the rescue during server failures. Invest in relatively inexpensive redundancy features (e.g., power, fans, network cards) that most modern servers include, and protect your local server storage against disk failure with mirroring (i.e., use an internal RAID controller or NT mirroring).

Common shared-disk cluster storage creates a single point of failure: If the cluster storage becomes inaccessible, so does your system. Implement disk controllers as redundant pairs that work together. Provide redundant power and cooling for the storage unit. Protect disks, ideally by mirroring.

Implementation
Remember, you aren't implementing just two servers and a storage unit, you're implementing a cluster. You need specific knowledge and skills to ensure successful performance. I recommend you read as much authoritative documentation as possible. (For a list of useful documentation, see "MSCS Resources.") Don't rely on the NTS/E printed manual, which is out of date in several key areas. Research the subject thoroughly, not only to gain cluster-specific knowledge but also to find out how a cluster environment will affect your existing processes. For example, you might be using Rdisk or Regback as part of your overall security strategy. The cluster Registry hive, Clusdb, resides in the \winnt\cluster directory; neither Rdisk nor Regback will automatically copy this hive. Unless you use Regback to manually copy the hive, your Emergency Repair Disk (ERD) or manual repair directory will be incomplete.

Implementing MSCS. The implementation process comprises several stages. Bear in mind that your aim is to configure all cluster elements as perfectly as possible, leaving only circumstances beyond your control as threats to your system's availability. Complete and test each stage, and resolve any problems before you progress to the next stage—don't wait until after you complete the full installation process to resolve problems. I suggest that you progress through installation following these steps (refer to detailed documentation for step-by-step installation instructions).

  1. Install all the hardware (e.g., servers, controllers, disks).
  2. Install NTS/E on each server, and upgrade to Service Pack 3 (SP3). SP3 comes with NTS/E, so use this service pack during these initial steps. You can apply a higher service pack later in the process, if you want.
  3. For recovery purposes, build a second basic OS installation (i.e., an installation without software other than programs that you need to run your network card, tape drive, and cluster storage access) on each server. Try to put this emergency recovery installation on a different disk from the server's primary installation. Install SP3.
  4. Install any additional device drivers that you need to access cluster common shared-disk storage.
  5. Use an external access method (e.g., serial port) to configure the cluster storage controllers. Configure one device to be the cluster quorum disk. (For information about quorum disks, see Mark Russinovich, NT Internals, "Inside Microsoft Cluster Server," February 1998.)
  6. Install MSCS on one server; keep the second server at the OS selection menu. Reboot the first server, and confirm that MSCS Cluster Administrator connects to the cluster service and displays the cluster details.
  7. Install MSCS on the second server while the first server is fully booted. Reboot the second server, and confirm that Cluster Administrator now shows both servers in the cluster.
  8. Confirm that the Cluster Group and Quorum Disk Group can move successfully between servers during both manual initiation and server shutdown.
  9. If you want to use a service pack higher than SP3, apply it (and any hotfixes) now.
  10. If you want to configure more cluster storage devices, do so now. Follow the method that Microsoft's "MS Cluster Server Administrator's Guide" describes. You can find this guide on TechNet (http://www.microsoft.com/technet) or on the Microsoft Product Support Services (PSS) Web site (http://support.microsoft.com). Incorporate the additional devices into the cluster resource groups, and test the devices for successful failover. You now have a functional cluster on which to build applications.

Installing applications on the cluster. Installing an application in a cluster setting requires planning. As I mentioned, avoid applications that update the Registry during operation. Identify the configuration files, status files, and log files that the application reads, updates, and writes. You need to store these files on cluster disks so that the files will move with the application from one server to the other.

Determine the application's resource dependency relationships. For example, you can't bring a database online before its data disks and client-access network name, and the network name depends on the associated virtual server's IP address. You can define dependency relationships between resources in the same resource group only; therefore, all resources that you associate with an application must be in the same group as the application. If multiple applications or multiple application instances (e.g., databases) share a resource, all the applications or instances must be in the same group. This requirement restricts your ability to load-balance different applications between two servers. When you want to run the same application on both servers, you need to define the application's resources on both disks as well. If the application uses any Cluster Group resources (e.g., the cluster network name), you'll need to add the application's resources to the Cluster Group. To install and validate a typical application or instance, follow these steps:

  1. Create a resource group for the application. Define appropriate cluster disks in the group. Define other relevant resources (e.g., network name, virtual server IP address). Confirm that you can bring the group online from both servers.
  2. Bring the group online on one server (i.e., server A). Install the application on server A, and configure the application to use the cluster storage. Confirm that the application functions correctly (i.e., starts, runs, and stops correctly) on server A.
  3. Define the application services or programs as cluster resources. Set services to start manually rather than automatically. Confirm that you can use Cluster Administrator to correctly stop and start the application.
  4. Move the group to the other server (i.e., server B). Install the application on server B and configure the application to use the cluster storage; take care not to overwrite any settings that you defined on the cluster storage when you configured the application on server A. Again, set any services to start manually. Confirm that you can use Cluster Administrator to correctly stop and start the application on server B.
  5. Confirm that the application will fail over from one server to the other in all circumstances (e.g., manual server shutdown and server failure).
  6. Follow the same process for other applications. After each installation, confirm that the existing applications still work correctly.

You can configure utility software, such as backup or system-monitoring software, to recognize each server individually or to recognize the cluster environment as a whole. The correct approach depends on the utility's characteristics and functions. For example, suppose you want to use a simple system-monitoring utility to monitor a clustered service. The utility sends SNMP alerts to a management framework when certain services aren't running. But at any time, the clustered service is running on one server and stopped on the other. If you run the monitoring utility on both servers, the monitor on the server with the stopped service won't be aware that the service is running correctly on the other server and will constantly generate alerts.

One solution might be to include in the management framework logic that recognizes that the two servers are related and that only raises an alert when both servers report that the service isn't running. Another solution might be to integrate the system monitor into the cluster with one application instance running and use the MSCS cluster command-line utility to confirm that the monitored service is online.

Maintenance
You've created a stable, highly available environment—now you want to keep it that way. And if something goes wrong, you need to be able to put it right quickly.

If you need to perform hardware or software maintenance on a clustered server, you can move all the server's resource groups to the other server and perform the maintenance while the application services are online. However, the best approach is to perform the work during application downtime, if possible. That way, you can avoid unexpected service loss if something goes wrong. In particular, don't follow the rolling service pack upgrade method that Microsoft describes in recent service pack release notes; wait until you can upgrade both servers while the application services are offline.

MSCS doesn't have any Distributed Lock Manager (DLM) functions; therefore, only one server at a time can access a particular disk. By default, the MSCS software grants cluster-disk access to a server only when Cluster Service determines that server as the rightful owner of that disk. Therefore, Cluster Service must start successfully on a server and either form or join a cluster before the server can access cluster disks. If a disk problem (e.g., a corrupt quorum log) causes Cluster Service to fail, you can't access the disk to repair the problem. If the quorum log becomes corrupt, Cluster Service writes an error identifying the problem to the NT System log before the service shuts down. You can use the ­noquorumlogging parameter to manually start Cluster Service from the Control Panel Services applet. This parameter lets Cluster Service start without trying to open the quorum log. Delete the quorum log (i.e., \mscs\quolog.log on the quorum disk), then stop and restart Cluster Service. The service will create a new quorum log.

MSCS RESOURCES
BOOKS
Windows NT Microsoft Cluster Server
Author: Richard Lee
Publisher: (McGraw Hill, 1999)
Microsoft Articles
"Cluster Server Troubleshooting and Maintenance White Paper"
http://support.microsoft.com/support/ kb/articles/q238/6/27.asp
Related Articles in Previous Issues
You can obtain the following articles from Windows 2000 Magazine's Web site at http://www.win2000mag.com/articles.

BRAD COOPER
"Installing Microsoft Cluster Server," October 1998 Web Exclusive, InstantDoc ID 3923
JONATHAN CRAGLE
"Balanced Cluster Service," February 1999, InstantDoc ID 4812
RON MILIONE
"Paving the Way for Microsoft Cluster Server," May 2000 Web Exclusive, InstantDoc ID 8207
JIM PLAS
"Build a High-Availability Web Site with MSCS and IIS 4.0," June 1999, InstantDoc ID 5371
MARK RUSSINOVICH
NT Internals, "Inside Microsoft Cluster Server," February 1998, InstantDoc ID 2943
BARRIE SOSINSKY
"NT Clusters," November 1999, InstantDoc ID 7291
As another complication, you can't run Chkdsk on a cluster disk. Cluster Service locks cluster disks, so Chkdsk can't access them, and cluster disks come online only after Cluster Service starts, so you can't set Chkdsk to successfully run during a reboot. If you want to run Chkdsk on a cluster disk, shut down one server and use the other server's emergency recovery installation to boot the first server. Because the recovery installation doesn't include cluster software, it doesn't restrict access to the cluster disks. You can now run Chkdsk on the problem disk, then reboot both servers through the regular cluster installation.

Some software, drivers, and utilities might not be perfectly suited to a clustered environment. Before you modify a cluster, back up the system and refresh the ERD. Make the change, then confirm that all cluster operations still function and resource groups still move between servers. Backup procedures can be complicated because you can connect drives to either cluster server. Develop a simple procedure that will work under typical conditions, then develop specific procedures for specific failure scenarios.

MSCS or Bust?
MSCS can let you have NT and high availability at a reasonable cost, but you need to consider your business needs, applications, and hardware before deciding whether MSCS is the right tool for your environment. The effort you put into cluster installation and management will affect your cluster's success—and reflect the importance your organization places on its services.