Let's say you decide to build your intranet using Microsoft SQL Server on one Intel-based server and Internet Information Server (IIS) on another. Suppose the IIS server dies and, within a few seconds--no user down time--all the IIS users and processing automatically switch to another server. What if Windows NT Server had that failover capability tightly integrated, and this capability required no proprietary hardware? Interested?
Or, suppose you convince your CEO that NT really can scale. You put a 100GB SQL Server database on your four-way Pentium Pro system to serve 5000 users, and it runs out of gas. Where do you go from there? Do you need to look at an eight-way parallel system, or can you just add another four-way Pentium Pro server and have it work on the same database?
Imagine you can keep your system running while you're applying updates to the operating system or key application software. For example, you can wait until an off-peak time and move an application easily from one server to another and then apply a new version or service pack to the offline server and run some tests. When you're comfortable with the change, you can move the application back, testing as you go. If a problem occurs, you can easily move the application over again. All this while, users are up and running the very application you're updating. Is this an administrator's dream?
Okay, now that you've seen a few scenarios, let me formally define clusters, the technology that allows such solutions. A cluster is a group of whole, standard computers working together as a unified computing resource that can create the illusion of being one machine--a single-system image. The term whole computer, which is synonymous with node, means a system that can run on its own, apart from the cluster.
Clusters have been addressing the problems of availability, performance, and scaleability for years. Until now, however, cluster vendors have focused on serving high-end customers, ignoring high-volume server markets. Gregory Pfister, a senior technical staff member at IBM's server group in Austin, Texas, wrote In Search of Clusters: The Coming Battle in Lowly Parallel Computing. In this book, Pfister identifies three keys to making clusters a high-volume solution: the speed of microprocessors, the availability of standard high-speed communications, and the existence of standard tools for distributed computing.
Pfister believes all three requirements, especially the speed of microprocessors, are now met. The time is ripe to move clusters into the high-volume market, like the one NT serves. The current crop of NT microprocessors--Intel, Alpha, MIPS, and PowerPC--are as fast as the CPUs in the largest computers. One example is Intel's new standard, high-volume (SHV) system, a four-way Pentium Pro-based motherboard that Intel developed to take advantage of NT's symmetrical multiprocessing (SMP) capabilities.
Intel is planning to create complete SHV systems for some OEMs, who will change only the machine's faceplate. With Intel in the system business, you can expect these high-end machines to roll out from more than 20 vendors within a few months, creating a ready-made SHV market for clusters.
Pfister predicts that vendors will greatly profit from developing a high-volume cluster market if they can take advantage of the timing and solve a few fundamental problems, such as creating a single-system image, forming a standard so that software vendors aren't locked into one hardware vendor's solutions, and pricing cluster software licenses so that the cost doesn't exceed that of large parallel systems. Solving these problems could drive a tenfold increase in the quantity of cluster solutions shipped each year.
To appreciate the significance and implications of clusters in the NT world, you can look at the work Microsoft and its partners are doing on NT clusters. You need to know about Microsoft's emerging Wolfpack cluster standard, get a perspective on what various cluster solutions can do, and understand what various vendors are presenting to the NT market. For some background on the book that predicted the direction of clusters, see my review of In Search of Clusters and my interview with Pfister.
What Is Wolfpack?
Several leading NT Server systems vendors, including Compaq, Digital Equipment, HP, NCR, and Tandem, have been independently working on clustering solutions for a few years. These vendors agreed to pool their expertise with Microsoft in an initiative to produce a cross-vendor standard for NT Server clusters. This group wanted to give NT Server customers the greater choice and flexibility they wanted. So in October 1995, Microsoft announced its intent to develop strategic partnerships to fashion a new clustering standard with the code name Wolfpack.
This name and many of its technology goals derive from Pfister's book. In Chapter 4, Pfister describes a cluster as a "pack of dogs." While searching for a code name for the API, Microsoft came across this book and decided to describe clusters with the name Wolfpack, which sounds a lot cooler than Dogpack.
Wolfpack is an alias for clusters, and the six core vendors in Microsoft's clustering project consider themselves members of the Wolfpack. These members are Digital, Compaq, Tandem, Intel, HP, and NCR. Each partner contributes key components of its existing technology. Other vendors, including Amdahl, IBM, Octopus, Vinca, Marathon, Stratus, and Cheyenne, have agreed to support the Wolfpack API. These vendors are part of Microsoft's Open Process, which includes about 60 vendors and customers who are part of design previews during various stages of Wolfpack development.
Wolfpack describes a set of cluster-aware APIs, NT cluster support, and a clustering solution (which means a vendor can claim to be Wolfpack compliant while competing with the Wolfpack solution on a different level--so if a vendor claims to support Wolfpack, you need to ask how). Here's a detailed explanation of each Wolfpack component.
|TABLE 1: Clustering Levels|
|Availability Level Recovery Time Failback Both nodes used for work|
|Standby||40 to 200 seconds||No||No|
|Active||15 to 90 seconds||Yes||Yes|
|Fault Tolerant||less than 1 second||Yes||No|
Wolfpack: The API
You can make applications cluster aware by calling the Wolfpack API. The services the API accesses can speed recovery; let you take additional actions, such as proactively notifying users on failover; let you restart and reacquire nonstandard resources; and let you monitor and detect more subtle application faults than a simple crash or lock-up. Potentially, Wolfpack API services will let applications achieve higher scaleability and do dynamic load-balancing on a cluster. Microsoft has not yet announced details of how the BackOffice applications will exploit the Wolfpack API services to become cluster aware.
Wolfpack: The Cluster Support
Cluster support will make all NT Server applications Wolfpack compliant in the sense that they will run exactly the same on a server that has Wolfpack as on a non-Wolfpack server, and Wolfpack will be able to do basic failover recovery of any NT Server application, without any need for you to modify the application. Wolfpack handles failover of an unmodified application by executing it through a provided wrapper dynamic link library (DLL) that notifies the cluster manager of the application's existence and creates a basic heartbeat (a check-in on the other cluster machine and its answer, at regular intervals) so Wolfpack can tell whether the application goes down or locks up. The pricing and packaging of Wolfpack is not set, but I can imagine Microsoft adding cluster support to NT in the same way that NT includes SMP support today.
Wolfpack: The Solution
Microsoft will deliver Wolfpack, the solution, in two phases. Phase 1 is two-node availability and scaling clusters (a new version of SQL Server will let you work on the same database from two servers at once). Phase 2 will allow more than two nodes in a cluster.
Reread the first paragraph in this article. That scenario describes a June 1996 demonstration of a Wolfpack availability cluster solution at PC Expo in New York City. This two-node failover capability is the basis for Phase 1 of Wolfpack (early 1997 is the estimate for delivery). The price for Wolfpack's Phase 1 release is not set, but one rumor is that NT Server will include Wolfpack at no additional cost. As I write this article, Compaq, Digital, HP, NCR, Amdahl, Stratus, and Tandem have all announced plans to OEM the Wolfpack-based cluster solution.
The next step in Phase 1 (set for the second quarter of 1997) will be an open certification program with the goal of expanding the market for two-node cluster solutions and giving NT Server customers a greater selection to choose from. Microsoft is also committed to making Wolfpack available on Intel, Alpha, PowerPC, and MIPS chips.
Reread the second paragraph in this article. It illustrates the need for scaling clusters; these clusters allow more than one node in a cluster to work on the same problem. This capability, application striping, is analogous to RAID redundant arrays of inexpensive disks that work together on one set of data, performing data striping. Scaling clusters would handle performance and scaling requirements of large applications and databases.
Phase 2, which will go into beta in 1998, will support clusters that have more than two nodes. Increasing the number of nodes in clusters can provide significantly more application scaleability and flexibility in availability than is possible in a two-node cluster.
The objective in Phase 2 is to certify large clusters that have 16 or more nodes, each of which any NT Server machine can support. NT Server's architecture supports up to a 32-way SMP machine.
Microsoft has not yet determined what additional capabilities, if any, Wolfpack will need in order to exploit large, Phase 2 clusters, but many Wolfpack members have solved such problems before and are the industry leaders in scaling cluster technology. A few of these vendors tell me that NT already has many necessary hooks to support clusters. This situation is no accident, because Dave Cutler, the architect of NT, was also the architect of Digital's VAX, the first OS to deliver commercially available clusters.
The first phase of Wolfpack, availability clustering, has a wide range of capabilities, depending on the amount of up-time you need. For the sake of discussion, I have grouped these solutions into three levels: standby, active, and fault tolerant. The amount of recovery time, whether a solution provides failback, and whether you can use both nodes for work are the criteria that differentiate the levels as currently available products implement them.
Table 1 illustrates how each level differs from the others. Each clustering level offers a different type of solution, and each solution has implications about the future of clustering on NT. Various vendors fit into each category, and understanding their solutions, their Wolfpack strategy, and their future direction is important for anyone considering clusters.
Figure 1 illustrates the cluster configuration for the standby category. In this configuration, the primary server does all the work and mirrors any data to the other standby server in the cluster. Standby clusters require a full second copy of all data. The standby node checks the status of the primary node several times a minute to make sure it's up and running. For most solutions, the standby node sends a heartbeat--pings the primary server. If the primary server doesn't respond, the standby server changes its status from standby to primary and takes over the application load from the failed primary server. This solution automatically switches users of NT, Windows 95, and Windows for Workgroups (WFW) 3.11 to the new active server. Mac, OS/2, and Windows 3.1 users have to manually log on to the new system. When the standby server becomes the primary server, standby clustering logs off users who were logged on to the standby server before failure.
The vendors that support standby clustering are Vinca, IBM, Octopus, and Compaq. Following is an overview of the products these vendors offer.
Vinca: Ray Noorda, now of NFT Ventures and former CEO of Novell, is chairman of the board of Vinca. This com-pany's product, StandbyServer, is configured as a disk controller device driver and appears as another disk drive. Vinca uses NT's native disk mirroring technology to keep the primary and standby servers in synch. By relying on NT's native facilities, Vinca achieves a high degree of application compatibility. Whereas other systems check for hardware failure (the heartbeat method) alone, Vinca can also check for software failures. You can configure StandbyServer to monitor application and OS processes down to the thread level.
Vinca is committed to open architecture. StandbyServer works with any Intel-based servers and any SCSI controller. The servers do not have to be identical. A dedicated high-speed link connects the primary and standby servers. This link can be an Intel EtherExpress PCI 100Mbit (Mb) or Vinca's EISA 100Mb card. Other drivers are in development. StandbyServer works on NetWare, OS/2, and NT.
To get a glimpse of where StandbyServer for NT is headed, you just need to look at StandbyServer for NetWare, which has enhanced features over Vinca's NT product. The NetWare version supports fiber technology for direct connection. In addition, you can use any node on the network as the standby server. SnapshotServer is an add-on product that facilitates backing up live files without negatively affecting network performance. Because the standby server has a live copy of files at all times, you can back up from the standby machine without affecting the network. This capability does not replace backup, but enhances regular backup software. Vinca plans to support an active-active availability cluster in the near future.
StandbyServer for OS/2 is the only availability solution for OS/2, which is why IBM fully endorses and distributes StandbyServer for its customers who need availability clusters today. StandbyServer for NT includes the software and a 25-foot dedicated link, but not the hardware interface.
IBM: IBM recently announced the IBM PC Server High Availability Solution, which bundles Vinca's StandbyServer with IBM's PC Server hardware. IBM will distribute this product through its business partners to customers who need an NT cluster solution today.
IBM will bring more than 20 years of experience in high-end cluster solutions when it moves to the NT market. One likely approach is to port the IBM Scalable POWERparallel (SP) cluster solution, which offers scaleability and performance, to NT. The objective is a cluster solution that runs on industry-standard hardware and fully supports the Wolfpack APIs. Also, IBM plans to make its Software Servers suite cluster aware. The first two products will be Notes Cluster and DB2 Cluster. The rest of the suite will follow in 1997.
Octopus: Octopus Technologies is shipping Octopus Automatic Switch Over (ASO) for NT. This unique solution can mirror data anywhere on a LAN or WAN. Other products require the clustered nodes to be near each other. The ASO feature already allows N-way failover conditions: You can connect more than two nodes to a designated server, which can assume the work of any connected node. Another unique feature of Octopus is its ability to mirror files on any hardware that supports NT. Octopus (unlike Digital's solution) can create a failover cluster between an Alpha and an Intel server.
Octopus inserts itself into NT's file system and has a proprietary replication engine. One benefit is the speed of replication. Unlike other replication technologies that mirror at the file or disk level, Octopus replicates changes only. In slow-speed connections, however, you can lose data that isn't committed at the time of failure.
Compaq: Compaq offers a Recovery Server Option Kit that you can configure in two modes, Standby Recovery Server and On-Line Recovery Server. The kit includes software, cables, and a switch for a two-node configuration. In addition to the kit, you need any two Compaq servers, Compaq's external storage box, Compaq's SCSI cards, and a Compaq hardware interconnect card. Compaq implements failover primarily through proprietary hardware, rather than software.
The Standby Recovery Server option requires a manual element in its failover process. After a failover, users must log in to the standby server manually. The On-Line Recovery Server is an active availability clustering solution.
| Digital Clusters for Windows NT |
* 800-354-9000 or 800-344-4825
clusters/default.htm or http://www.digital.com (To find your local reseller)
Price: Software, $995 per server
Prioris Kit: $3000-$4500
Amdahl * 408-746-600
Isis Availability Manager
Isis Distributed Systems
* 508-460-2430 or 800-258-0990
Email: firstname.lastname@example.org or
NCR LifeKeeper for Windows NT
NCR * 800-225-5627, Ext. 1000
Price: $1500 (per node)
Marathon * 508-266-9999
| Octopus Server for NT Server/Workstation |
* 215-321-8750 or 800-919-1009
Price: $999, ASO Option: $249
Stratus * 888-723-4672
Recovery Server Option Kit
Compaq * 800-345-1518
for Windows NT
Vinca * 801-223-3100 or 800-934-9530
Microsoft * 206-882-8080
Figure 2 shows the second type of availability cluster, active availability. Whereas standby clusters require a full second copy of all data, active clusters don't. In this configuration, both servers are primary and doing meaningful work, as the scenario in the first paragraph of this article illustrated. When one node fails, the users and the applications fail over to the available node in the cluster. The users experience a delay in their processing and, in some cases, can lose data that was not saved. All users and applications on the available node continue to work unchanged, although both sets of users are now running on one server instead of two. This combination of users will slow both sets of users. Manual load balancing can minimize the impact of a failover condition. Another feature is automatic failback, which lets the processing and users return to the failed node once it has recovered.
The vendors that have active availability cluster solutions are Microsoft, Digital, Compaq, Tandem (Unix only), NCR, HP (Unix only), and Amdahl. Here's a brief summary of each of these offerings.
Microsoft: You've already read an overview of Wolfpack, so let's look at two prerelease screen shots to get the flavor of how Wolfpack clusters are configured. Screens 1 and 2 are from the Wolfpack version that Microsoft demonstrated at PC Expo.
Screen 1 is a view of the cluster administrator's console. This view shows how Wolfpack lets you manage an application and all its related resources as one group. Here, the SQL Server group includes the SQL database, a disk drive, and an IP address. With one mouse click, the administrator can move this entire group to another machine in the cluster. This capability makes load-balancing or taking a server off line for routine maintenance easy to do without bringing down important business applications.
Screen 2 shows Wolfpack's Resource Dependencies window. Traditionally, one of the tough administrative jobs with clusters has been figuring out how to prioritize all the various applications and resources so that they fail over and restart in the right order. With Wolfpack, the administrator uses this point-and-click window to establish the dependencies for each resource in an application group. Wolfpack then automatically figures out the correct restart priority for all the resources that a server or application failure affects.
Digital: If you're looking for a solution that is probably close to the Wolfpack solution, check out Digital Clusters for Windows NT. Digital launched its product before Wolfpack's release because of NT market demand for clusters. Once Wolfpack ships, Digital will provide a migration wizard to help Digital's NT cluster customers move to Wolfpack. If any functionality that exists in Digital's product doesn't make it into Wolfpack, that functionality will be available as a low-cost add-on called the NT Cluster Plus Pack.
Digital's solution supports failover of the NTFS file system, Microsoft SQL Server 6.5, and Oracle7 Workgroup Server 7.1 and 7.2, and scripting allows generic application failover. Digital also supports failover between two Intel-based servers and between two Alpha-based servers, but not between an Alpha and an Intel server. According to Digital, the problem is with NT, and Microsoft needs to address it: The page log size is different on RISC systems (such as Alpha) and Intel. At press time, Microsoft had no plans to remedy this situation.
Compaq: Compaq's On-Line Recovery Server meets the active cluster criteria except that it does not provide automatic failback. That capability automatically reroutes applications and users to their primary server if the failed node is recovered.
Compaq plans to upgrade its products with Wolfpack-compliant products when they become available. In addition, Compaq will migrate its SCSI switching technology to Tandem's ServerNet Interconnect Technology. Current customers potentially face a two-step migration--Wolfpack compliance and ServerNet implementation.
Compaq will make sure that all its hardware can participate in all Wolfpack-compliant configurations. "We want to be absolutely compatible, but also differentiate our products from other solutions," said Tim Golden, Compaq's cluster manager. "One way we will differentiate ourselves is through our alliance with Tandem, which includes their ServerNet technology. It delivers redundancy at all component levels. ServerNet has higher availability, scaleability, and throughput than other cluster interconnect devices we've seen," said Golden.
Tandem: Tandem has no NT cluster solution but will fully support the Wolfpack solution when it becomes available. Tandem plans to support Wolfpack on the low end and provide its Himalaya servers for situations that call for scaling beyond the limits of its Intel SMP-based systems.
On May 7, 1996, Tandem joined the Wolfpack core team by announcing that Microsoft had funded an effort to port Tandem's high-end availability products to NT. Tandem has built its reputation by providing high-end availability and scaleability servers for the last 20 years. The company sees NT as its ticket to move beyond the high-end market into the high-volume market. In fact, during the announcement, Tandem declared that NT really means New Tandem.
The Tandem/Microsoft alliance has several key points. First, Microsoft will fund ($30 million) the port of Tandem's NonStop ServerWare Solutions to NT Server. These solutions include Tandem's parallel, scaleable SQL database, Tandem's clustered transaction-processing environment supporting the TUXEDO and CICS transactional APIs, and Tandem's distributed messaging and object management environment. This technology will let NT Server users take advantage of Tandem's Independent Software Vendor (ISV) portfolio of more than 1000 business-critical solutions, including online transaction processing, electronic commerce, Internet/World Wide Web, data warehousing and decision support, online analytical processing, and other business-critical solutions for the finance, telecommunications, retail, healthcare, and transport markets.
In addition, Tandem will port its ServerNet technology to NT this fall. Developed for Tandem's large Himalya machines, ServerNet allows very high-speed communications between nodes in a cluster. ServerNet potentially provides capabilities that current open definitions do not. For example, if all the I/O devices connected to the cluster are on ServerNet, you can fail over not just disks, but printers, tape drives, and any other I/O device. As Wolfpack gets closer to delivering large scaling clusters, this high-speed I/O will be very important. The ServerNet drivers will ship with Wolfpack, and Compaq and Tandem and its partners will sell the complete solution.
NCR: Wolfpack from NCR will be a part of an overall high-availability story that includes LifeKeeper for Windows NT. LifeKeeper offers many features of Wolfpack Phase 1, including automatic failback and automatic reconnection for all client types. NCR will position LifeKeeper as a value-add clustering product with support for the Wolfpack APIs. This support will let LifeKeeper run all Wolfpack-compliant applications on Intel-based servers.
Although NCR will sell the Wolfpack solution, this company is also committed to keeping LifeKeeper one step ahead of Wolfpack. For example, the company plans to introduce a three-node cluster for LifeKeeper by the first quarter of 1997. In this configuration, all three nodes are active and can fail over to each other. In addition, a future release of LifeKeeper will support Oracle7 Parallel Server. Oracle has not announced support for Parallel Server on Wolfpack, the solution.
LifeKeeper for NT includes three recovery kits, one each for TCP/IP, NetBEUI, and SQL Server. In addition, recovery kits for Oracle, Lotus Notes, Sybase, and Exchange are available.
HP: HP's clustering roadmap includes its MC/Service Guard, Wolfpack, and Oracle Parallel Server. HP will port MC/Service Guard, now available on HP 9000 Unix, to HP NetServer application servers running NT Server. This approach will give MC/ServiceGuard customers an NT cluster solution that will not require learning a new paradigm.
Amdahl: A company long associated with high-end computing, Amdahl is offering an active cluster that scales to eight EnVista servers, which are based on Intel SHV systems. This solution is already beyond the two-node cluster of Wolfpack Phase 1.
The key to this level of scaleability is the EnVista Availability Manager, which is really the Isis Availability Manager licensed from Stratus. The Isis Availability Manager runs on each node, and a majority voting mechanism, not a heartbeat, determines when a node has failed. Once the cluster participants vote out a failed node, one of the remaining nodes picks up the load, according to rules-based logic in the cluster configuration. Isis can recognize hardware, software, and performance failure and already provides N-node failover.
The node interconnect uses a switched, full-duplex, 100Mb Ethernet. By first quarter 1997, Amdahl will offer an interconnect rated at 40MB per second (MBps), probably from Fujitsu. For disk access, Amdahl uses its LVS 4500 storage solution instead of shared SCSI, providing data availability with dual-ported node failover capability.
Amdahl views compatibility with industry standards as critical. Once Wolfpack becomes available, Amdahl will add support for it and will recommend it as the preferred cluster technology for new customers.
Stratus: Stratus provides N-node availability in a pre-assembled configuration called a RADIO Cluster. A RADIO Cluster has six nodes: two computer, two storage, and two network nodes. Every component is redundant.
Unfortunately, Stratus calls every component a node, so figuring out how this system fits together took me awhile. Once I got over that hurdle, I was amazed at the engineering that went into these units. The compute module, for example, has a two-way Pentium processor, a 1GB IDE drive for booting the system, and 100 Base T redundant hubs. The storage modules will support up to four 2GB PCI Fast and Wide SCSI-2 disk drives that you can custom-partition to support various access and recovery schemes. The redundant network modules are high-speed, inter-networking hubs that interconnect compute modules and storage modules and route all messages and data necessary for application execution in the RADIO cluster. RADIO requires 1" high drives in the storage nodes.
Stratus owns the Isis Availability Manager, which is loaded with Isis Active Replication Technology in each node. This approach allows all the active availability features in a clustered environment with no single point of hardware, software, or network failure. Stratus plans to support the Wolfpack API, so any Wolfpack-compliant applications will be able to run on this system. Stratus has a lot to offer besides supporting the Wolfpack API: up to 24 compute and storage nodes can be in one cluster, all nodes are hot-swappable, zero downtime for NT database applications is available through optional Isis for Database software.
On the high end of availability clusters is the third level, complete redundancy by means of fault-tolerant clusters. As Figure 3 shows, very part of the cluster is active and redundant with another component. Failover times are within one second. The goal is 99.999% up-time, or about six minutes of downtime per year. This capability is characteristic of the solutions from Marathon.
Marathon: Marathon provides fault tolerance with off-the-shelf components. MIAL 1, Marathon's first product, focuses on realtime data protection. A basic configuration requires three computers: a compute server and two data servers. The network cards, CPU, and disks are completely redundant. This system writes information from the compute server to both data servers simultaneously. If one data server fails, the users and processing will continue on the main server.
MIAL 1 uses a proprietary interconnect between the two data servers. This card offers full-duplexed, hardware-assisted data integrity checks and 32MBps throughput. If one data server fails, the system cluster will remain operational. When the failed server is repaired, the available data server will automatically resynchronize the recovered server by replicating the entire disk storage to the recovered server. This replication happens in the background at the rate of about five to 10 minutes per GB of storage. MIAL 1 replicates the entire disk and assumes everything on the recovered server was bad.
MIAL 1 has one point of failure: If the compute server fails, the cluster will go down. Marathon says it will fix this failure point in the next version, MIAL 2.
Marathon's solution does not require any application to be cluster aware, nor does it require recovery scripts. Marathon's position is that this ease of use lets administrators easily implement fault-tolerant systems. The company wants to emphasize total cost of ownership--Marathon believes its solutions let you keep up with the power curve in the industry, the latest CPUs. You need no APIs, no scripts, no special applications, no special version of NT. So why does Marathon need Wolfpack? The company wants to let its cluster solution participate as a node in a Wolfpack scaling cluster.
MIAL 1 includes the three proprietary Marathon Interface Cards (MICs), software, and a SplitSite Data Link, which lets you configure the system and assists in system management. With this interconnect device, you can plug in fiber optic drivers or copper cable and connect between buildings. You can also configure this device to activate an alarm if a component fails.
Making Clusters Commonplace
Windows NT has taken a lot of criticism for not scaling or being as fault tolerant as large systems. Clusters let Microsoft address these concerns in a way that fits with its high-volume channel strategy. The traditional enterprise vendors are handing over some of their most prized solutions to participate in this next wave of enterprise computing. Once again, NT is the bridge between the high-end and high-volume solutions market.
The high-volume cluster market is only beginning. Will clusters of four-way SMP systems have better price/performance than 8-, 12-, and 16-way SMP systems? Can Microsoft encourage ISVs to write Wolfpack cluster-aware applications that provide fault tolerance and scaleability? If the answers are yes, Microsoft is in a good position to make clusters commonplace.
The flavor of each cluster solution is interesting, but the success of the standard is critical. As with its other standards, Microsoft will declare the Wolfpack API a standard when more Wolfpack-compliant applications are shipping than the sum of all solutions based on competing technologies--when Wolfpack has more than 50% market share. During design previews, more than 20 software vendors resolved to deliver Wolfpack-compliant applications by March 1997. These applications will be cluster aware and will provide capabilities, including scaleability, beyond basic failover. I expect Microsoft to have SQL Server cluster aware by the same time and the rest of the BackOffice suite by the end of 1997.
As Wolfpack-compliant applications become available, Microsoft will probably develop a new logo, something like, "Windows NT Cluster Enabled," to show that a solution is cluster aware. What does cluster awareness buy you? In the event of a failure, a cluster-aware application can restart each user right where he or she left off. In a cluster-aware database, a cluster-aware application can start the database server, log in the user to the database, and restart an existing query. A non cluster-aware application can return a message such as, "drive not available," and you have to manually return to the previous state.
Microsoft has just released the preliminary Wolfpack API set to the Open Process participants under nondisclosure agreements, so vendors are only now beginning to develop Wolfpack-compliant applications. The theory is that such applications will run on any Wolfpack-compliant cluster solution. Unfortunately, the cluster solutions are very different, so making this goal a reality is challenging.
To get a vendor's perspective on this challenge, I spoke with Cheyenne, a company that is working on a Wolfpack-compliant add-on to its backup solution, ARCserve 6.0.
Cheyenne views Wolfpack as a way to satisfy increasing demand from enterprise-level customers, who want Cheyenne to support clusters. Cheyenne already supports availability features such as RAID and recovery, so the addition of support for availability clusters is a logical next step. In ARCserve's RAID 5 implementation, three or more tape devices together can perform one backup. Screen 3 illustrates Cheyenne's approach. If one tape device fails, the backup continues without interruption. In addition, you can put each tape device on a separate SCSI bus to provide bus fault tolerance. Finally, you can restart tape backups after a failure.
Cheyenne believes it can support standby and fault-tolerance clusters today by attaching the tape devices to the mirrored server. The Wolfpack solution (active availability) is much more challenging than that approach. First, Cheyenne needs to determine whether including the tape devices on the same SCSI bus as the disk drives is possible. That way, both nodes in the cluster can share the tape device. If this method is possible, Cheyenne needs a way to switch from one node to the other. If the tape devices are attached to both nodes, you have some interesting tape-management problems to solve.
The challenges to a Wolfpack-style solution are not trivial. In the past, such challenges meant a vendor had to create a different version for each clustering solution it supported. This necessity caused application vendors to support only the cluster solutions with the highest market share. If application vendors such as Cheyenne can tackle these problems by implementing the Wolfpack API, we will see many cluster-aware applications in the next 18 months.
| Cheyenne * 516-465-4000|
HP * 301-670-4300 or 800-752-0900
IBM * 520-574-4600 or 800-426-3333
Tandem * 408-285 6000 or 800-538-3107