For UNIX and VMS administrators, clustering is old hat. But for someone coming from a PC LAN environment, clustering might be a new concept. Clustering lets multiple servers work together as a unified computing resource to service a group of tasks, provide fault tolerance and continuous availability, or offer dynamic scalability. For networked users, clustering provides nonstop database, email, file, and other system services. In theory, you should be able to blast a hole through one of your clustered servers, and your users never notice the difference.

In "Clusters for Everyone," Mark Smith gives you an overview of the current Windows NT clustering market. He discusses who the players are, how their clustering solutions fit in the market, and how you can use these clustering solutions in an NT environment. In the product reviews that follow this introduction, the Windows NT Magazine Lab examines seven NT clustering solutions.

The Lab picked clustering solutions that represent major clustering technologies: disk mirroring, file replication, and fault tolerance. Some vendors implement their clustering solution with the focus on software. For example, Octopus' Octopus SASO and NSI's Double-Take are primarily software-based solutions. Several vendors combine hardware and software products to create a solution. Cubix's RemoteServ/IS combines Cubix technology with Citrix WinFrame. For NT Cluster-in-a-Box, Data General combines ALR servers, Data General's CLARiiON disk subsystem, and VERITAS' FirstWatch clustering software. IBM resells its Intel-based systems and disk array with either Vinca's StandbyServer for NT or Microsoft's Wolfpack. Amdahl uses its Intel-based servers and LVS disk array with FirstWatch, Wolfpack, or NCR's LifeKeeper software.

All the vendors of the products the Lab reviewed, and most of the vendors we list in "Buyer's Guide for Clustering Solutions," will support the Wolfpack APIs. For a quick overview of the other basic features and capabilities of the products we tested, see "Clustering Solutions Feature Summary." In "Clustering Terms and Technologies," you'll find explanations of common clustering terms used throughout the product reviews.

Getting Technical
Clusters work in many different ways, depending on the technology you choose. Through specific hardware and software, you can set up two-node clusters that eliminate just about every common single point of failure: power supplies, disks, processors, and network connections. If enough components fail on one of the nodes, the other node takes over. All the solutions available for NT offer two-node, shared-nothing clusters: The server nodes in the cluster are self-contained, independent computers (no shared memory bus, no shared disk). Unfortunately, shared-nothing clustering presents difficult problems to overcome, such as how to offer fully fault-tolerant, available systems and how to perform dynamic scaling. Microsoft and clustering product vendors have not yet addressed these problems in either NT or the add-on products. The technology is coming, but you might have to wait a year or two before you can get it. NT 5.0, Wolfpack 2.0, SQL Server 7.0, Exchange Server 6.0, Internet Information Server (IIS) 4.0, Transaction Server 1.0, and so on will each bring us a step closer to a self-contained, scalable BackOffice solution--Oracle Parallel Server on NT may provide a solution sooner.

NT clustering solutions have a common thread: tricking NT into making two separate systems work together as one through services and applications bolted on top of the operating system (NT is not particularly friendly to this functionality). The solutions the Lab reviewed that offer object failover all employ similar methods to get NT systems to work together. A heartbeat between the two nodes (over a direct network crossover connection, a LAN link, a serial link, or the SCSI bus) signals each system about the other's status. If all heartbeats disappear, the remaining live node assumes control over assigned assets (objects).

Clustering solutions let you create failover objects (such as disk volumes and applications) with a primary server and a secondary (fallback) server. If the secondary server fails, nothing happens to the services running on the primary server. If the primary server goes down, the clustering software switches the service (SQL Server, files, etc.) to the secondary server. You can set up a few of the solutions the Lab reviewed in either an active/active configuration (both servers run the same application) or an active/standby configuration (only the primary server runs the application). An active/active configuration lets both servers perform meaningful work (useful for load balancing), and if one server fails, users on the other server are not disturbed as it takes over the additional load.

From a technical standpoint, you need to protect any system that needs 99 percent availability (e.g., a critical file server, database system, messaging or groupware platform, Web server) with a cluster. To protect other applications such as financial services, you need to either obtain the proper application kit for the solution you choose or write your own, using a software development kit (SDK).

Most of the clustering solutions the Lab reviewed require experienced systems administrators--you need NT expertise and a good understanding of your user and server applications for smooth operation. Administrative overhead plus the high setup costs of these solutions make them more appropriate for large IS shops. Small shops can use these clustering solutions too, but they need to weigh benefits vs. costs before proceeding.

Are you thinking, "I already have two servers running my applications, so why can't I just cluster them?" You can, but not without reinstalling NT, reinstalling your applications, loading the clustering software, and reloading your data. Most of the vendors support their clustering solution on only certified hardware (verified servers and disk arrays with double-ended SCSI buses), which you might not have. For example, Microsoft won't support Wolfpack on older hardware, unless the hardware manufacturers go through the certification process.

Obviously, you need to pick a clustering solution that addresses the problems you want to solve. Do you need application failover for continuous availability, or just file replication for fault tolerance? Is performance important? How many users are you supporting? How much money can you spend? No one product will satisfy all these needs. Some products offer a large feature set but are expensive; other products are basic and inexpensive. Some products will interoperate with other solutions, letting you compound their effectiveness.

Ready for Prime Time?
The clustering solutions the Lab reviewed offer very high availability (99 percent), and they significantly shorten service interruptions to users. Unfortunately, the clustering solutions don't give that extra 1 percent that some UNIX or VMS solutions provide.

Some of the NT clustering solutions' limitations stem from how the vendors must design clustering solutions. The clustering service and software is more a wrapper around application and operating system services than it is an integrated core component.

Currently, applications for NT do not integrate into the clustering methodology--the BackOffice applications, in particular, are not cluster aware. Clustering solutions for NT still have too many single points of failure (the hardware isn't up to the task and isn't 100 percent fault tolerant), and the clustering software does not support dynamic scaling or n-way clusters. These solutions are not 100 percent reliable and they do not provide 100 percent availability, so failover isn't completely transparent to the user.

In the reviews that follow, you'll see that the Lab experienced mixed results from the NT clustering solutions we tested. We found that some of the clustering solutions are not quite ready for prime time, but some are--if you're willing to invest significant effort to set them up and make them work.