For enterprises with mission-critical applications, clustered systems are the acid test of OS enterprise-worthiness. Although Windows NT cluster systems aren't widespread, several vendors began to demonstrate solutions at conferences this past spring and summer that significantly alter the NT clustering landscape. For information about cluster offerings from Sun Microsystems, Novell, and other vendors, see the sidebar "NT Spawns Cluster Products," page 40. International Data Corporation (IDC) predicts that businesses will cluster 60 percent of all NT servers by 2001.

Clustering in Windows 2000
Microsoft Cluster Server (MSCS) appeared in NT Server 4.0, Enterprise Edition in early 1998 as the Wolfpack project. This 2-node shared-nothing cluster provides failover in case one server ceases to operate. At the moment, vendors such as Compaq offer a money-back guarantee that MSCS will achieve 99.9 percent uptime (i.e., only 8.76 hours of downtime) a year. The 5 minutes required for a reboot reduces the status of MSCS from a mission-critical application to a high-availability solution. I never did buy Microsoft's argument that the majority of Microsoft customers can live with 8.76 hours of downtime per year. And MSCS left other OS vendors yawning and snickering.

Because of Microsoft's painstakingly slow progress in clustering advancements, MSCS will ship with 2-node clustering in Windows 2000 Advanced Server (Win2K AS). But Microsoft's base technology in clusters is about to take a significant next step. In the February 2000 time frame, Microsoft will ship a 4-node cluster based on MSCS with Windows 2000 Datacenter Server (Datacenter). In a 4-node cluster, successive members acquire the quorum resource that defines the cluster (typically a storage device), and the cluster doesn't require a system reboot to reallocate resources when a member system fails. Cluster members can distribute the workload of a failed cluster member with load balancing based on TCP/IP redirection. So, Microsoft's 4-node cluster has more clustering abilities than the 2-node cluster.

Improved clustering is a major feature of Datacenter. With the expansion of SMP support to 32 processors, Datacenter will provide scalability and fault tolerance that large enterprises will consider carefully. Datacenter will also come with Process Control, a new management tool that Microsoft developed jointly with Sequent Computer Systems, based on the Job Object technology. Process Control lets you create rules that manage application resources in large server clusters.

Kevin Briody and Vohtan Racivorski, managers for Microsoft Clustering, describe three cluster architectures available on the market: server clusters, network load-balancing clusters, and application clusters. For information about cluster architectures, see "Related Articles in Windows NT Magazine," page 42. Server clusters provide failover when one server ceases to recognize the cluster heartbeat signal. MSCS is this type of cluster. Network load-balancing clusters have a master server that distributes requests to member servers to create balance. Microsoft's Network Load Balancing (NLB) performs this type of clustering on as many as 32 servers. Microsoft based NLB on Valence Research's Convoy Cluster, which Microsoft acquired.

Application clusters, also called component load-balancing clusters, have a router that load balances COM+ components. Application clusters enable high availability in the business-logic tier of a distributed network solution.

Originally, Microsoft intended application clusters to be part of Datacenter. However, on September 13, Microsoft announced that it will spin off application clustering as an individual product called AppServer. For information about Microsoft's application clustering product, see "Microsoft Demos Its Next-Generation Internet Tools," articles, InstantDoc ID 7268.

Microsoft has tested an 8-node application cluster and will probably introduce a 16-node cluster in the near future. For information about cluster technology resources, see the Microsoft Web site at ntserver.

The three cluster architectures are shared disk, mirrored disk, and shared nothing. VAX clusters and Oracle Parallel Server use a shared-disk model; VAX has scaled to 16 members, and Oracle Parallel Server has scaled to 8 members. Network Specialists Incorporated (NSI), VERITAS Software, Octopus Technologies, and Vinca use mirrored clusters for disaster recovery, and these clusters aren't scalable. (In April, Legato acquired Octopus Technologies through the acquisition of FullTime Software; in July, Legato acquired Vinca.)

Microsoft chose a shared-nothing cluster model over the shared-disk cluster model. A shared-disk cluster requires a proprietary hardware solution to access the disk. Microsoft wanted its cluster API to use an industry standard and run on the widest range of commodity hardware. Thus, Microsoft chose the shared-nothing cluster. Microsoft also contends that accessing the shared-disk resource using a Distributed Lock Manager (DLM) prevents scaling beyond a certain limit.

Microsoft improvements to clustering in Windows 2000 (Win2K) will improve scalability and reliability. In contrast to the length of time that Microsoft took to go from a 2-node cluster to a 4-node cluster, Microsoft might soon demonstrate products for much larger clusters—provided that Microsoft customers demand these advancements.

Cornhusker Clusters
IBM's Cornhusker project made major advancements in NT clustering. Upon viewing an 8-node Cornhusker cluster at PC Expo in June, I thought that IBM had established a clear differentiator for the Netfinity server line. To demonstrate Cornhusker, IBM shared the keynote speeches at the Windows Hardware and Engineering Conference (WinHEC) 99 and TechEd conferences around the world. Cornhusker runs only on Netfinity servers using proprietary software that IBM wrote to extend the MSCS implementation on NT Server 4.0. Applications need to be cluster aware or written specifically to use MSCS. DB2 is cluster aware because IBM wrote its clustering software to enable this feature for DB2. Cornhusker ran IBM's DB2 Universal Database as one application image.

Cornhusker, a server cluster with load balancing, uses the Netfinity Availability Extensions for Microsoft Cluster Services. IBM's architecture provides a management layer that performs load balancing using IBM's Phoenix technology. IBM calls this arrangement the X-architecture. In Cornhusker, the cluster is an n+1 cluster, with the extra server acting as a hot spare. When a member server goes down, the hot-spare server comes online to provide failover to another member server. (You can also distribute the workload across the member servers.) In addition, the hot-spare server can be active at all times. Cornhusker provides considerable flexibility in the management and configuration; with Cornhusker, you can mix and match various types of Netfinity servers without regard for power and speed. IBM designed Cornhusker's architecture for easy upgrade to the 4 (or more)-node releases of MSCS. IBM's Longhorn project has Oracle8i running on Cornhusker.

An IBM advancement that has made larger-node NT clusters such as Cornhusker possible is the replacement of the 2-node shared SCSI bus, which contained the quorum disk and the heartbeat signal, with a fibre channel switch and fibre connections that provide greater connectivity and transmission distances. Mark Shelly, product manager of Interconnect Technology at IBM, said that IBM called upon RS/6000 team members to help design Cornhusker and that the team lifted the Netfinity SP fibre switch from the RS/6000 for use on NT. This switch (code-named Redhawk) offers 2.4GBps transmissions per port in as many as 14 ports. IBM will sell the switch as a separate product. Dan Roy, product manager for Netfinity Clusters and High Availability Solutions, said Cornhusker will offer Domino and file-and-print services first.