The other day, a hardware failure brought down our Exchange server. This failure created a panic
in our user community because we consider email availability as important as a dial tone. Had we
been using a Windows NT cluster, we users would never have noticed the problem. By providing
continuous availability through replication, an NT cluster could have saved us a lot of frustration
and prevented the loss in productivity.
Today's NT clustering solutions solve one business computing problem: availability. By
replicating data, applications, and even entire systems, clustering lets two or more systems watch
each other's back and take over the workload (user connections, applications, and services) in case
one system fails. This article will review the types of clustering solutions currently available,
categorize clustering solutions, and illustrate what types of business computing problems clustering
can help solve now.
So What's a Cluster Anyway?
A cluster is a group of whole, standard computers that work together as a unified
computing resource and that can create the illusion of being one machine, a single system image.
(With NT clusters, the term whole computer, which is synonymous with node, means a
system that can run on its own, apart from the cluster. If you're not familiar with clustering
terms, you can refer to "Clustering Terms and Technologies.") This unified
computing resource ensures availability because any node can take on the workload of any other node
that happens to fail.
Clusters come in three configuration types: active/active, active/standby, and fault tolerant.
Let's examine each of the three types of cluster configurations:
- Active/active: All nodes in the cluster perform meaningful work. If any node fails,
the remaining node (or nodes) continues handling its workload and takes on the workload from the
failed node. Failover time is between 15 seconds and 90 seconds.
- Active/standby: One node (the primary node) performs work, and the other (the standby,
or secondary node) stands by waiting for a failure in the primary node. If the primary node fails,
the clustering solution transfers the primary node's workload to the standby node and terminates any
users or workload on the standby node. Failover time is between 15 seconds and 90 seconds.
- Fault tolerant: A fault-tolerant cluster is a completely redundant system (disk and CPU) whose
goal is to be available 99.999 percent of the time. That goal translates to fewer than 6 minutes of
downtime per year. Both nodes of a fault-tolerant cluster simultaneously perform identical tasks;
the nodes' workloads are redundant. Failover time is less than 1 second.
To illustrate the definition of a cluster, let's say you have users doing file and print on
Server A and another group of users accessing an Oracle database on Server B. Servers A and B are
nodes in an active/active cluster. If Server A fails, Server B continues handling its workload and
picks up Server A's workload. The users accessing the Oracle database do not notice any change in
their service; the users doing file and print at most experience a short delay.
NT Clustering Solutions
As the need for availability becomes ever more crucial in the NT environment, many third-party
vendors and Microsoft have introduced or are about to introduce clustering solutions for NT. To help
you evaluate these clustering solutions, let me briefly explain Microsoft's clustering initiative,
Wolfpack, and categorize its capabilities in comparison with those of some prominent third-party
clustering solutions. (For reviews of several individual clustering products, including Wolfpack,
see Lab Reports.)
Wolfpack
Wolfpack is Microsoft's two-node, active/active clustering solution and set of APIs for NT.
Wolfpack's purpose is to provide high availability to your NT Server environment.
Wolfpack will have an effect in several significant areas. First, you can expect all server
manufacturers who want to reach NT customers to offer Wolfpack-based clustering support this year.
Even a year before its release, Wolfpack had the backing of Digital Equipment, Compaq Computer,
Tandem, Intel, Hewlett-Packard, NCR, and IBM.
Theoretically, Wolfpack will work on any two Intel-based or any two Alpha-based servers, but
you can't mix Intel and Alpha. However, in practical terms, the number of supported systems will be
very restricted because to get on the Wolfpack Hardware Compatibility List (WHCL), each manufacturer
must test complete configurations (system, disk subsystem, and SCSI adapter) for compatibility. This
approach stands in contrast to NT's existing Hardware Compatibility List (HCL), which lets
manufacturers list individual system components. For the WHCL's first release, Microsoft will let
each manufacturer list only two configurations. Microsoft will support Wolfpack only for systems on
the WHCL, so don't try to build your own Wolfpack clustering solution. Although these requirements
will initially limit the selection of Wolfpack-compliant configurations, the WHCL will grow over
time.
The second area that Wolfpack will affect is storage. In a Wolfpack-based solution, you need
only enough storage in your servers to run NT Server and Wolfpack. A disk subsystem that both
servers share will provide the bulk of your storage. As a result of this approach, server
manufacturers will want to differentiate themselves by improving their storage performance. Those
manufacturers that don't have their own subsystems will have to obtain them from storage providers
such as CMD Technology, Data General, and BoxHill Systems. Some manufacturers, such as Compaq, will
use clusters as a way to promote fibre-channel based storage solutions because fibre-channel storage
has significant advantages over SCSI, in both throughput and cable length.
Third, Wolfpack will affect server applications. Wolfpack is not only a clustering solution,
but a set of APIs. These APIs let developers make their server application "cluster aware."
Such awareness could mean easier installation in a clustering environment, better failover
capabilities, and the ability to scale an application beyond one node. For example, Microsoft plans
to use the Wolfpack APIs with its Transaction Server to let two nodes work on the same SQL Server
database query. This technology combination is fundamental to Microsoft's plans to provide
enterprise scalability.
The Wolfpack APIs have been available to developers for only a short time, so only a few
applications will initially be available. However, as the adoption of clusters becomes more
commonplace, the demand for cluster-aware applications will increase as well. Expect Microsoft's
BackOffice applications to become cluster aware during 1997 and 1998.
Fourth, Wolfpack will have an impact on other NT clustering solutions. Many competing NT
clustering solutions have already declared support for the Wolfpack APIs. This API support will let
Microsoft's competitors support Wolfpack cluster-aware applications and still provide enhanced
functionality over the Wolfpack solution.
Finally, the price and availability of Wolfpack-based solutions will drive NT cluster solutions
into the mid-to-low end of the server market. The price of Wolfpack-based solutions is about 20
percent of the price of solutions available for UNIX. This pricing alone will make companies that
have never considered clustering take a look at it. In addition, the availability of Wolfpack-based
solutions from many vendors will create competition, improve awareness in the market, and help
stimulate demand in the mid-to-low markets that they serve.
Third-Party NT Clustering Solutions
Wolfpack isn't the only game in town. In fact, several solutions are more mature than Wolfpack,
offer additional functionality, and solve different problems. Table 1 lists some prominent solutions
(including Wolfpack) and categorizes the type of clustering solution they offer, their data-handling
strategy, their hardware interconnect, and their flexibility in hardware choices. (For a summary of
information about the clustering solutions reviewed in this issue, see "Clustering Solutions
Feature Summary," and for information about other clustering solutions, see "Buyer's
Guide to Clustering Solutions.") Let's look at some of the categories in Table 1, and
then we can apply our knowledge of clustering solutions to some real-life scenarios to determine
what solution is best for a given situation.
Data handling. NT clusters use one of three data-handling methods: mirroring,
switching, and redundancy. In mirroring, one node replicates another node's data. Octopus,
NSI, and Vinca rely on this technique. With switching, each node has its own disk source,
which may be RAID or just a bunch of disk (JBOD). Both nodes share a SCSI bus, which lets them take
over the failing node's disk. Finally, with redundancy, the clustering solution writes data
to both nodes simultaneously.
Hardware interconnect. The hardware interconnect is the required physical link
between the nodes in the cluster. Several solutions require proprietary connection devices. Other
solutions use any type of TCP/IP-supported connection, such as Ethernet.
Hardware flexibility. The hardware flexibility column in Table 1 rates
available choices for nodes. For example, Stratus' solution works on only Stratus hardware and is
therefore rated poor in the flexibility column. Wolfpack requires manufacturers to list complete
configurations--not components--on the WHCL, and therefore, receives a rating of fair. Octopus will
work with any NT-based servers (Intel, Alpha, MIPS, PowerPC), and therefore, is rated excellent.
Vinca will work with any two NT-based servers (Intel only) and therefore, is rated good.
Scenarios
A variety of clustering solutions can solve availability problems in an NT environment. The
purpose of the following scenarios is to show how you can apply clustering solutions to solve
specific problems.
SITUATION 1
Expanding Your File and Print Server
Problem: Your company has a single-processor Pentium-based NT Server that you use for
file and print, and it is running out of steam. Your applications include a heavily used multi-user
Access97 database and Office97. You have to reduce downtime, especially with the Access97 database,
which has become critical.
Solution: If you buy an additional server, you can use a mirror-based solution such as
Octopus to connect the two servers into a cluster. Now you can ease your capacity crunch by putting
your Office97 files on one server and Access97 on the other server. At the same time, you can
replicate critical data between the servers and create a fault-resilient environment.
Could you use Wolfpack in this situation? You could, only if your new configuration is on the
WHCL, which is highly unlikely right now. Also, Wolfpack requires a SCSI-based disk subsystem, which
is an extra purchase.
SITUATION 2
Setting Up a Web-based Storefront Using Merchant Server
Problem: Your company has decided to take orders and payments over the Internet. For
optimum performance, you decide to run Merchant Server and Internet Information Server (IIS) on one
server and SQL Server on another. Because both servers will have active users, you need an
active/active clustering solution. A 30-second delay is acceptable during failover. You have 30 days
to deliver.
Solution: Wolfpack isn't shipping yet, so you can go with either LifeKeeper or
FirstWatch. Because you have no existing equipment, you can buy a SCSI-based solution (two servers
and one disk subsystem) from a single vendor. One possible solution is Data General's NT
Cluster-in-a-Box, which comes to you with everything preconfigured from the manufacturer. (For a
review of this solution, see "NT Cluster-in-a-Box.") If you can wait until
Wolfpack ships, it will also solve your problem.
SITUATION 3
Credit Card Verification Service
Problem: You've decided to cash in on the electronic commerce craze and provide realtime
verification for credit card transactions on the Internet. Even a few seconds of failure could
result in the loss of millions of dollars of transactions.
Solution: If you're brave enough to try this service on NT, your only solution today is
from Marathon Technologies because it's the only solution that offers subsecond failover times and
eliminates the need to restart user transactions. Its configuration duplicates both memory
(redundant compute nodes) and disk (redundant data nodes).
Marathon Technologies' solution takes four off-the-shelf computers working together to create a
cluster. (For details about this solution, see the sidebar, "Marathon Technologies' Endurance
4000.") You do not need to make any software changes.
SITUATION 4
Hot-Site Backup
Problem: As part of your disaster recovery plan, you want to maintain a hot site in case
your primary site is destroyed. This plan requires the ability to mirror a server to a location 20
miles from the primary site.
Solution: Most clustering solutions today assume that the cluster nodes are within two
miles of each other. Therefore, you need a solution that can provide mirroring across a WAN.
Currently, only Octopus, NSI, and Vinca can provide this functionality. (For reviews of these
solutions, see "Octopus SASO 2.0," "Double-Take 1.3 Beta,"
and "Vinca StandbyServer for NT.")
SITUATION 5
Remote Application Access
Problem: You need to provide fault-tolerant remote access to your 500-member sales
force. They need 24*7 remote access to your company's applications.
Solution: A Citrix server will solve the remote application access problem. Cubix
offers a fault-tolerant solution for Citrix servers by providing load balancing and failover for
multiple Citrix servers in a manageable communications cluster. (For a review of the Cubix solution,
see "RemoteServ/IS.")
SITUATION 6
OS/2 Users Need Access to Lotus Notes 4.0
Problem: Your OS/2 client users need immediate access to Lotus Notes 4.0 for NT. Lotus
Notes is a critical application, so if users lose access for longer than 90 seconds, you're fired.
Solution: Vinca's StandbyServer for NT is one of the few solutions that support OS/2
clients. IBM is one of Vinca's key distributors and provides OS/2 support. Purchase a new server to
run DB2/NT, and use the old server as a standby server.
SITUATION 7
Schedule Upgrades to Your System
Problem: You would rather not spend all your nights and weekends upgrading your systems.
Solution: By putting your servers into a cluster group, you can manually fail over a
node during working hours. Remember, the users are still working on the remaining node. Now you can
apply a service pack, test it, and pray.
Once you are satisfied that the service pack changes are working, you can manually fail back
the node and the workload. Any NT clustering solution currently available will work in this
scenario.
SITUATION 8
Manually Load Balancing Your System
Problem: You have too many applications running on one server while another server is
barely used.
Solution: Ordinarily, you have to take down both servers, change their configuration,
and restart. If the servers are part of an active/active cluster group, you can manually fail over a
single application without taking down an entire node. This approach effectively moves the
application from one server to another.
You must make sure the solution supports application failover (as opposed to system
failover). Application failover lets you fail over a single application without taking down the
entire node, instead of failing over the entire system. For example, even though Octopus is
active/active, it supports only system failover today, which requires taking down the node. However,
soon after you read this article, Octopus SASO 3.0 will be shipping, and it supports
application-level failover.
SITUATION 9
Two SQL Servers
Problem: You need high availability for users accessing two independent SQL Server
databases, each running on a separate server.
Solution: You need an active/active application clustering solution so that both nodes
can be running SQL Server simultaneously. This requirement eliminates Wolfpack from your list of
choices, because it can run only one instance of SQL Server per cluster. However, Digital
Equipment's Wolfpack clustering add-on pack and NCR's LifeKeeper let you run two copies of SQL
Server in the same cluster, allowing each server to be the fallback for the other and thus
increasing availability.
SITUATION 10
Scaling Exchange
Problem: You want to scale Exchange to run faster and have high availability. You have a
dual Pentium Pro server.
Solution: Adding two CPUs to your server configuration would be nice, but
unfortunately, Exchange scales effectively to only two CPUs (for more information about Exchange's
ability to scale, see Joel Sloss, "Optimizing Exchange to Scale on NT," November 1996). In
fact, the next release of Exchange (version 6.0) has been dubbed the "performance release"
and will address this scalability problem. Wolfpack won't address scalability until phase 2, which
isn't due until 1998. So are we stuck?
Valence Research's Convoy Cluster claims to add availability and scalability for TCP/IP
applications and to provide load balancing among nodes in a cluster. This product is primarily aimed
at intranet applications. Convoy Cluster was not available when we tested solutions for this issue.
If this solution can scale, it will leapfrog Wolfpack by a year.
Future Trends
As these scenarios demonstrate, Wolfpack is not the appropriate solution in every case. Even so,
Wolfpack is having a huge effect on hardware and software vendors.
When Wolfpack phase 2 starts shipping in 1998, developers can use the Wolfpack APIs to create
applications that will let cluster nodes work in parallel. The issue of scalability will start a
heated debate among system vendors: Is a cluster of 4-way SMP systems better than 8-, 12-, and
16-way SMP systems? If the answer is yes, NT will never have to scale beyond four CPUs in a single
system. As long as you can cluster 4-way systems and scale performance, NT will have a price and
performance unrivaled in the marketplace.
In the early adoption phase, companies will want to buy complete cluster-in-a-box
configurations, hoping to eliminate as many problems as possible. However, as clustering moves
mainstream, users will demand the ability to mix and match components. Keeping up with NT's HCL is
hard enough, and keeping up with the WHCL will be even harder. Octopus has been on the leading edge
for more than two years, by letting users mix and match components easily. Other vendors will need
to do the same.
As more system vendors support Wolfpack, additional features will provide a competitive
advantage. For example, Digital supports Wolfpack, but also offers a cluster add-on package that
lets both nodes of a cluster run SQL Server and gives existing users of Digital NT Cluster a
migration wizard. Compaq, Tandem, and Dell will enhance their Wolfpack offerings by supporting
ServerNet, a high-speed interconnect. NCR supports Wolfpack, but also supports LifeKeeper, which
allows three-node clusters, compared with Wolfpack's two-node limitation.
Finally, look for other vendors to solve the scalability problem before Wolfpack. For example,
Oracle Parallel Servers lets two or more Oracle database server nodes work on the same database,
running queries in parallel on multiple nodes. Oracle will try to one-up Microsoft by shipping this
level of scalability on NT before Microsoft can release the parallel version of SQL Server (version
8.0).
Useful article but would be good to also cover "floating IP addresses" and how they are used (if they are used) for Microsoft clustering.