Keeping mission-critical applications available for employees, business partners, and customers is a key goal for any IT department. Many IT administrators turn to clustering—physically and programmatically linking two or more systems to ensure high availability and to balance workloads—to achieve this goal. Clustering lets application processing continue on another server if the primary system fails or you need to shut down the primary system for maintenance. Clustering can also improve performance by evenly distributing workloads among several servers. No matter how many systems make up the cluster, it appears as one server to end users.

If your primary goal is to improve the availability of your back-end applications, you can use Microsoft Cluster service, which is included in Windows 2000 Advanced Server (Win2K AS) and Win2K Datacenter Server, or choose from a variety of competing products. (Cluster service is known as Microsoft Cluster Server—MSCS—in Windows NT Server, Enterprise Edition—NTS/E.) All these products move processing of back-end applications, such as enterprise resource planning (ERP), database management, and messaging applications, from the primary cluster server (called a node) to other cluster nodes when the primary node fails. This process is known as failover. After you repair the failed server, the clustering software shifts resources and processing back to the original node, a process known as failback. Making the best clustering-product choice depends on multiple factors, including the risks to your computing environment, your current disaster-recovery plan, the geographic separation of your servers, the location of clients that use the clustered applications, the number of applications and servers to be clustered, and of course, your budget.

Clustering Basics
When deciding on a high-availability strategy, you must first understand that clustering doesn't deliver true fault tolerance. No matter whose clustering product you choose, failure of a cluster node or application results in 5 seconds to 30 seconds of application downtime, depending on the number of transactions written to the transaction log since it was last saved. In addition, depending on the design of the client application, users might have to reconnect to the clustered application when it resumes on the new node. For some environments, these inconveniences are inconsequential. Other environments might need a fault-tolerant solution that can deliver higher levels of availability than clustering products provide. For more information about two such solutions, see Ed Roth, Lab Feature, "Stratus ftServer 3210," July 2002, InstantDoc ID 25335, and John Green, Lab Feature, "Endurance 6200 3.0," July 2001, InstantDoc ID 21140.

A common clustering implementation is to have all clustered nodes run the cluster-protected applications so that servers don't sit idle. This arrangement is referred to as an active/active configuration. Should one node fail or be shut down, the failover process copies the necessary resources to the active target node. Depending on the target node's configuration and utilization, the added workload might degrade performance of existing and failed-over applications. The alternative is to create an active/passive configuration, in which one or more servers sit idle or run nonclustered applications until the primary server fails.

Cluster Service
Win2K AS supports two nodes running Cluster service (as does NTS/E running MSCS); Win2K Datacenter supports as many as four nodes running Cluster service. Win2K AS's two-node limitation might be a problem if you want to cluster two Microsoft Exchange Server systems that would support several thousand employees. Microsoft recommends that you support no more than 1900 Exchange users per cluster node. If you followed this guideline, you'd need to create two active/passive clusters, with an idle node for each Exchange server. Alternatively, Win2K Datacenter's four-node limit would let you create a three-node cluster with one idle system serving as a failover node for both Exchange servers. (To anticipate the failure of both primary Exchange nodes, you'd need to add a fourth passive node to the cluster, of course.) But because Microsoft sells Win2K Datacenter only with server hardware, this would be an expensive clustering solution. Clustering multiple Exchange servers with Cluster service will be easier and less expensive when Windows .NET Enterprise Server (Win.NET Enterprise Server—the follow-on to Win2K AS) is available because it will support eight-node clusters and won't be packaged with server hardware. Obviously, the failover targets need to have sufficient processor, memory, and storage resources to support the additional workload of the failed servers.

Cluster service is a good solution if you plan to run more than one clustered application per node. If Cluster service detects that one of the clustered applications has failed and can't be restarted, Cluster service can fail over just the affected application without disrupting the others if the applications are cluster-aware and are in different resource groups.

Most Cluster service clusters share a common storage array that connects directly to each node or is part of a Storage Area Network (SAN). Shared-storage clusters are simpler to implement than clusters that replicate data between nodes, but the shared array becomes a single point of failure and limits the geographic separation of the cluster nodes. You can geographically separate Win2K AS or Win2K Datacenter's Cluster service nodes if each node connects to a SAN-connected storage array and each array's I/O controller synchronously replicates data and quorum disk information (i.e., cluster-configuration information stored on a special volume). Third-party products such as NSI Software's GeoCluster Advanced Server 4.1 can also perform data replication for Cluster service. Win.NET Enterprise Server and Win.NET Datacenter Server will let you separate nodes in two locations—the OSs will provide a quorum mechanism that synchronizes cluster configuration information among all nodes, but you'll still need a separate data-replication capability. All Cluster service nodes, storage arrays, and I/O controllers must be on the Win2K or Win.NET Cluster Hardware Compatibility List (HCL) if you want Microsoft to support them. Figure 1 shows Cluster service's management console.

Shared Storage
Cluster service is a full-featured solution that's well suited for many scenarios, but if you need to create three- or four-node clusters, Win2K Datacenter is an expensive route to get there. Win.NET Enterprise Server looks promising, but it won't solve your problem today. Cost might also be a problem if your servers aren't already running NTS/E or Win2K AS and cluster-aware versions of your applications, as Cluster service requires. In addition, if you want to locate a node in another geographic area as part of a disaster-recovery plan but you don't have an external storage array and SAN, Cluster service alone won't work for you.

Companies such as VERITAS Software, SteelEye Technology, Legato Systems, Computer Associates (CA), and NSI Software offer clustering products that address these shortcomings. All the clustering products provide an administrative console to manage the cluster from a central location.

If you intend to keep all your cluster nodes situated within your data center but you need support for more nodes than Cluster service supports, take a look at VERITAS Cluster Server 2.0 (pricing starts at $6000 per node). Figure 2, page 30, shows the product's administrative console. Like Cluster service, VERITAS Cluster Server uses a shared storage array, but VERITAS claims it has tested its product with as many as 32 nodes. VERITAS Cluster Server also offers an interesting load-balancing feature called Advanced Workload Management that lets you fail over a clustered application to a node with greater CPU or memory resources according to criteria that you set (e.g., when the number of customer transactions per minute reaches a level that you define). VERITAS says its clustering software works with any server and any version of Win2K or NT 4.0, but be sure that VERITAS has tested your storage array model.

SteelEye Technology's LifeKeeper for Windows 2000 and LifeKeeper for Windows NT (pricing starts at $2500 per node for each product) also use a common storage array. Like VERITAS with VERITAS Cluster Server, SteelEye says it has tested LifeKeeper for Windows 2000 with as many as 32 nodes. The vendor has tested LifeKeeper for Windows NT with clusters as large as 16 nodes. Creating such large clusters would be a configuration nightmare, but it's good to know that you can create a four- or five-node cluster if you need to. Figure 3 shows the LifeKeeper interface.

SteelEye's optional application recovery kits (ranging in price from $750 to $10,000 per node) monitor applications and attempt to restart a failed program before initiating a failover. Kits are available for Exchange, Microsoft SQL Server, Microsoft IIS, IBM DB2, Oracle9i and Oracle8i, and other enterprise applications.

Separate Storage
Instead of sharing storage, some clustering products store data on each node and employ replication to synchronize that data. This approach lets you place one or more nodes at other locations so that mission-critical processing can continue in the event of a local catastrophe. Although this model eliminates the single-point-of-failure problem inherent in shared-storage arrays, it introduces other challenges, particularly in a geographically dispersed cluster. High latency of the network connecting the nodes could degrade application performance for cluster solutions that use synchronous replication to synchronize data and cluster configuration information. Similarly, in asynchronous clustering solutions, insufficient network bandwidth might result in new transactions waiting to be transmitted over the network. If the node that the transactions are on goes down, the queued transactions won't reach the remote node.

SteelEye's LifeKeeper nodes use a common disk array, but the vendor's optional Extended Mirroring product ($1400 per node) lets you create a remote cluster node. Extended Mirroring supports synchronous and asynchronous replication.

Legato's Co-StandbyServer 2000 and Co-StandbyServer NT products (pricing starts at $6500 per node pair for the Win2K version and $5499 per node pair for the NT version) allow for a maximum of two clustered nodes and provide synchronous data mirroring between the nodes. The two products work with any version of their respective OSs. Figure 4 shows the Co-StandbyServer 2000 management console. Legato positions Co-StandbyServer primarily for local clustering.

The company offers RepliStor 5.0 (pricing starts at $2499 per node) to cluster a maximum of two cluster nodes in geographically separate locations. RepliStor uses asynchronous replication to synchronize the nodes. Unlike Cluster service and Co-Standby Server, which monitor the status of the clustered applications as well as the server and communications link, RepliStor monitors only server and network status, so you'll need third-party systems management software to alert RepliStor to application faults. RepliStor works with any version of Win2K or NT 4.0 and any server on the Win2K or NT 4.0 HCL.

At $1895 per node, CA BrightStor High-Availability Manager 7.0 is a relatively inexpensive clustering solution that uses asynchronous replication for data synchronization. According to CA, only network bandwidth and server resources limit the number of nodes. CA also says that you can use any Windows server as a node. In addition to providing a heartbeat, BrightStor High-Availability Manager lets cluster nodes ping other network servers or routers to verify network connectivity. BrightStor High-Availability Manager doesn't provide application monitoring, but you can use other systems management applications to alert BrightStor High-Availability Manager if an application fails. Figure 5 shows the BrightStor High-Availability Manager administrative interface.

NSI Software offers two clustering products that use asynchronous data replication. The company's Double-Take for Windows, Server Edition 4.2 ($2495 per node) and Double-Take for Windows, Advanced Server Edition 4.2 ($4495 per node) are an alternative to Cluster service. (Double-Take for Windows, Advanced Server Edition is for NTS/E and Win2K AS.) Figure 6 shows the Double-Take Management Console. NST's GeoCluster Advanced Server 4.1 product ($4495 per node) runs on top of Cluster service to let you geographically separate Cluster service nodes. According to NSI Software, Double-Take works with any Windows server and only network bandwidth and server resources limit the number of nodes. GeoCluster supports two nodes and runs on any Cluster service server. Double-Take detects only server and network-connectivity failures, so you need third-party systems management software to alert Double-Take if an application fails.

Pros and Cons
Each of the third-party products I've mentioned addresses particular needs that Cluster service doesn't meet, and all of them run with Win2K Server and NT 4.0 Server and standard versions of clustered applications. But the products don't match Cluster service feature for feature. For example, with some third-party products, you must use other systems management tools to alert the clustering software to an application failure.

If you decide on a clustering product that uses data replication rather than a common disk array, you'll need to do some homework to determine the bandwidth requirements for each clustered application and ensure that your network can meet those requirements. Even if you opt for a product that uses a common storage array, you might also employ a clustering product that supports distant nodes to keep your applications running after a local disaster. If you don't need the failover capability at a distant site, you can still use replication clustering products to create realtime backups at a remote site.

Contact the Vendors
Microsoft * 425-882-8080 *

650-527-8000 or 800-327-2232

SteelEye Technology
650-318-0108 or 877-319-0108

Legato Systems * 650-210-7000 *

Computer Associates * 631-342-6000 or 800-225-5224


NSI Software * 317-598-1174 or 888-674-9495