Keeping mission-critical applications running is one of an IT department's most important responsibilities. Although clustering products provide an effective high-availability solution, the failover process can disrupt application processing for 30 seconds or longer. Depending on the client application's design, users might have to reconnect to the clustered application when it resumes on the new node, and if the failed node sits on a remote site, you'll have to dispatch a technician to repair it. Furthermore, Windows 2000 Datacenter Server­based clusters require careful management to maintain their high level of reliability.

Several server vendors have developed specialized products that address some or all of these concerns. Marathon Technologies, NEC Technologies, and Stratus Technologies have introduced solutions that claim to deliver five-nines hardware reliability for departments and small-to-midsized businesses. Their solutions rely on fault tolerance rather than clustering technology and use Win2K Advanced Server with standard versions of your applications. Unlike clustering, in which a server failure halts applications temporarily while application processing shifts to an alternate node, fault-tolerant systems let applications run uninterrupted on a redundant subsystem. After you replace the failed parts, both clustered nodes and fault-tolerant systems halt processing temporarily. NEC and Stratus say that powering up and resynchronizing the new part (known as reintegration in fault-tolerant systems) can take as long as 12 seconds under Win2K AS. Marathon says its reintegration times are a few seconds at most. In contrast, failing back a cluster can halt application processing for 30 seconds or more.

When you compare clustering and fault-tolerant technologies, remember that Microsoft Cluster service addresses hardware and software failures, whereas fault-tolerant systems primarily address hardware reliability. Although the approaches that NEC, Marathon, and Stratus use in their fault-tolerant architectures should reduce the likelihood of a software failure, if you need Cluster service's high level of software reliability, you'll need to purchase cluster-aware versions of your applications, which is an added expense.

A Unique Approach
Marathon's Endurance 6200 system, which the Windows & .NET Magazine Lab tested ("Endurance 6200 3.0," July 2001, http://www.winnetmag.com, InstantDoc ID 21140), uses four servers that appear as one system to the application and the user. Two servers function as compute engines; the other two function as I/O Processors (IOPs). Marathon separates the application environment from most drivers, thereby shielding the application from driver-induced failures. The Endurance product is targeted at applications that run on single- or dual-processor server systems.

The compute engines have one or two processors, memory, core logic, and a disk drive; the IOPs contain from one to four CPUs, memory, disk controllers, storage, and NICs. All four servers are connected, and Marathon's proprietary NICs provide 50Mbps throughput. You must configure compute engines and IOPs identically. Compute engines are paired with I/O processors into what Marathon calls tuples. Both compute engines run the same applications in lockstep and store data on their respective IOPs. When a fault halts processing on one tuple, processing continues uninterrupted on the other.

When the Lab tested Endurance 6200 last year, the product performed flawlessly with uninterrupted application processing when we initiated hardware failures, with only short pauses when we reintegrated the downed tuple. In our tests, the longest application processing interruption during reintegration was just 4.5 seconds, much shorter than a cluster-failback interruption.

When we reviewed Endurance 6200, the only way to purchase the product was to buy identically configured server pairs with the product already integrated, which made Endurance 6200 expensive unless you implemented it as part of an existing plan to add new servers. Now you can purchase the software and Marathon Interface Cards (MICs) as a kit and retrofit existing servers for about $20,000 (you must still use identically configured server pairs for the compute engines and the IOPs). Endurance 6200 runs on Win2K Server, Win2K AS, Windows NT Server, and NT Server, Enterprise Edition (NTS/E).

Marathon recently announced a new software-only implementation of the Endurance product that will require two servers rather than four. (See the sidebar "2 Servers Are Better Than 4.") With the current Endurance hardware and software kit, your application needs only one license. If you purchase either product with a Win2K license, you'll need only one license for your OS, as well. If you choose to rely on your existing Win2K volume license agreement, the wording of that agreement will determine whether you'll need a second OS license. The new product (whose name hasn't been finalized) will be available for Win2K and will support Windows .NET Server (Win.NET Server) 2003 shortly after its release. Marathon has no plans to support NT with this product.

A Different Approach
NEC and Stratus have taken a different approach to fault-tolerant computing. Stratus licensed its fault-tolerant core logic chipset to NEC, and NEC has designed several two- and four-processor servers around it that both vendors market under their own names. Each vendor uses its own peripheral components or those of a selected OEM, however. These server designs use Win2K AS and unmodified, standard versions of applications. Unlike clustered systems, the NEC and Stratus products require just one OS and application license.

The first server based on both vendors' technology is a dual-processor server that employs Intel 800MHz Pentium III processors with 256KB cache. The Lab tested Stratus's version of this server, the ftServer 3210 (see "Stratus ftServer 3210," July 2002, http://www.winnetmag.com, InstantDoc ID 25335). NEC's version of this server is known as the Express5800/ft 320La.

These servers have user-replaceable component modules that let a bank branch manager, for example, replace a failed module and integrate the new module without having to send for an IT person. Each 8U (14") rack-mount server includes shoe-box­sized pairs of processor, I/O, and storage modules. Each processor module contains a motherboard with one or two processors, core logic, and memory; each storage module contains as many as three SCSI drives. The I/O module contains two SCSI controllers along with video subsystems and PCI slots. The server also includes a pair of NICs for maximum redundancy. The NEC and Stratus servers use VERITAS Software's VERITAS Volume Manager (included with each server) to mirror the OS and data on the pair of disk subsystems.

With this component redundancy, the servers can run applications in lockstep synchronization and execute parallel instructions on each pair of components. When the NEC and Stratus servers detect a fault, processing continues uninterrupted on the opposite subsystem (instead of the opposite pair of compute engine and IOP systems) and full redundancy for remaining subsystems is maintained. At that point, the system notifies the systems administrator who is managing the server, who can, if necessary, instruct an untrained employee to swap out the module with a new one without having to power down the server. Reintegration of the failed part should suspend application processing for no more than 12 seconds. When the Lab tested the Stratus version of this product, the intentional hardware failures and reintegrations we performed didn't result in any noticeable disruption of application processing.

Another interesting aspect of the NEC and Stratus products is their use of hardened device drivers for Gibabit Ethernet, SCSI, and Fibre Channel adapters; the hardened drivers let a failed adapter seamlessly fail over to its counterpart. In addition, these drivers isolate I/O exceptions from the OS's kernel, providing a stable OS environment that should reduce the number of software failures.

Since the introduction of their dual-processor servers, NEC and Stratus have also introduced new 4-way fault-tolerant servers based on a common hardware design. NEC's Express5800/ft 340Ha and Stratus's ftServer 6500 ship with 2.0GHz Xeon MP processors and 2GB of L3 cache in a 16U (28") rack-mount cabinet. The new design is also rated for five-nines hardware uptime using dual redundancy of components. For clients whose applications demand even greater uptime, both vendors offer a version of the server with three processor modules in an 18U (31.5") rack-mount cabinet for triple redundancy of that subsystem.

The four-processor server's standard I/O subsystem, with its two SCSI controllers, is similar to the two-processor model's subsystem. Like the 2-way server, this model also relies on VERITAS Volume Manager to mirror the data over two drive sets. The new server's larger storage module holds as many as 14 drives instead of just 6. For buyers who prefer a hardware RAID solution, both companies offer an additional Fibre Channel RAID storage module with dual RAID controllers and as many as 14 Fibre Channel drives.

Stratus also offers the ftServer 6500 in a two-processor configuration called the ftServer 5240, and NEC has introduced the Express5800/ft 320Lb, a dual-processor fault-tolerant system in a high-density 4U (7") rack-mount form factor. Pricing wasn't determined at press time. The Express5800/ft 320Lb, which uses NEC's core logic, has 2.4GHz Xeon DP processors with 512KB of L2 cache and is available with triple redundancy of the processor module in a 7U (12.25") rack-mount chassis. This product uses the same storage strategy as NEC's existing two-processor model with as many as six internal hard disks but merges the I/O and storage modules to save space.

Which Solution Is Right for You?
In our tests of the earlier Marathon and Stratus two-processor fault-tolerant systems, we were impressed with the way the products handled hardware faults and reintegration processes with almost no impact on applications. Configuration and setup were extremely easy with the Stratus product. We would expect the newer NEC and Stratus models to perform similarly and be just as easy to configure because they use nearly identical designs and the same technology.

But, as you might expect, component replication makes most fault-tolerant systems expensive. For example, NEC's and Stratus's two-processor 800MHz Pentium III systems (including two redundant processors), 2GB of RAM, and six 36GB drives cost almost $30,000. Stratus's new 4-way system in a typical configuration that includes a hardware RAID module and 14 drives costs about $160,000. Pricing for NEC's 4-way system in an identical configuration should be similar (pricing hadn't been announced at press time). Even if you delete the hardware RAID module and connect these 4-way systems to a Storage Area Network (SAN), prices hover around $125,000.

If you already have four servers (two identical pairs), the Endurance 6200 4.1 hardware and software kit, which sells for about $20,000, might be a cost-effective way to achieve fault tolerance. But the most attractive option might be Marathon's Endurance software offering, which requires just two identically equipped servers and could be an exceptionally attractive solution if the pricing is as low Marathon suggests it will be.

Comparing the manufacturers' uptime claims for these products to clustering solutions is inordinately difficult because a lot depends on the reliability of your applications. Obviously, if your applications demand the performance of more than four processors, clustering is your best alternative.

Contact the Vendors
ENDURANCE 6200 4.1,
ENDURANCE SOFTWARE

Marathon Technologies * 888-682-1142
http://www.marathontechnologies.com

EXPRESS5800/FT 320LA,
EXPRESS5800/FT 320LB,
EXPRESS5800/FT 340HA

NEC Technologies * 866-632-3226
http://www.necftservers.com

FTSERVER 3210, FTSERVER 5240,
FTSERVER 6500

Stratus Technologies * 978-461-7000
http://www.stratus.com