Fault tolerance beyond clustering

Fault tolerance means different things to different people. According to a broad definition, fault tolerance ensures that an application is always available to its users. For example, if a problem occurs with an application on one server in a clustered server scenario, another server takes over. But although clusters provide high availability for applications, they don't satisfy my definition of true fault tolerance because the application's recovery from a system failure isn't always transparent to users.

Clustering actually has several drawbacks. To be fully functional in a clustered setting, applications must be cluster-aware. Because developers must include code that supports clustering and failover, a typical off-the-shelf Win32 application doesn't fully support clustered operation. In a clustering scenario, the failover—and failback—process isn't perfect: When an application fails on a server, existing user sessions disappear, forcing users to reconnect to the application after it moves to a new server. If the application relies on the server to maintain session state information, the client's session state information is lost.

Marathon Technologies has developed a new approach that doesn't suffer from the shortcomings of clustered servers. In this Lab feature, I take a close look at the company's Endurance 6200 3.0 fault-tolerant server array.

A Unique Architecture
The Endurance architecture separates application processing and application I/O on two different computers and provides a redundant backup for each. The resulting set of systems—or array, in Marathon terminology—can continue application processing uninterrupted after any individual component fails.

As Figure 1, page 78, shows, an Endurance system consists of two pairs of interconnected Windows NT 4.0 servers that appear to the user (and the application) as one system. Each pair—or tuple, in Marathon terminology—includes one dual-processor Compute Element (CE) and one I/O Processor (IOP). All I/O devices (e.g., hard disks, network adapters, CD-ROM drive, 3.5" disk drive) physically connect to the IOPs. The CEs have only one standard I/O device: a 3.5" disk drive intended for firmware updates. The two IOPs also must have identical configurations, with the same NICs and RAID controllers in the same slots and identical disk configurations. Each computer uses a Marathon Interface Card (MIC)—a proprietary 50MBps, full-duplex, low-latency, dual-port card that interconnects each IOP with both CEs. In each IOP, you must also install another standard NIC (preferably Gigabit Ethernet) that Endurance can use to mirror data from one IOP to the other during installation and during the recovery phase following the failure of an IOP. You can separate the two tuples with as much as 500 meters of multimode fiber-optic cable.

Both tuples run your application at the same time, all the time. This method is the key design difference between an Endurance array and server clusters. If any element of the array fails, the application will continue to run uninterrupted on the rest of the server array. Within a tuple, the CEs perform all application processing. The IOP handles all I/O. This design insulates the CEs from any failures that disk subsystem components might induce. However, poorly written application code can still cause an application to fail. And because applications run on both CEs simultaneously, an application that fails for this reason will fail on both CEs.

Although both IOPs have a NIC connected to the same network segment, only one NIC is active at a time. If the active network connection fails, the standby takes over, using the same IP address and the Endurance-assigned soft media access control (MAC) address. Endurance lets you configure each IOP with as many as four NICs.

Endurance's underlying architecture is fairly complex. The system redirects network I/O from the active network connection to both CEs. Application I/O requests originate on one CE, and both CEs can redirect the I/O requests to both IOPs. I/O is synchronous—the system doesn't signal I/O write completion back to the application on the CEs until both IOPs have completed the write. Marathon reports that because of the involved overhead, the throughput of an application running on an Endurance array ranges from 85 to 90 percent of the throughput when the application runs on one server.

To let both CE processors execute the same instruction at the same time, the CEs must be identically configured, down to processor, BIOS, and firmware revision levels. The requirement that both CEs have exactly two processors—no more and no less—is a limitation of the current Endurance implementation. Marathon expects to support four-processor systems by the end of this year. Such support should make Endurance a more attractive platform for fault-tolerant database servers.

Endurance requires Intel-based servers. Marathon can provide you with a list of tested hardware configurations. In general, the system supports equipment from the major server manufacturers, and Marathon's Web site advertises systems based on Hewlett-Packard (HP), IBM, Dell, and Compaq hardware. The Endurance array that I tested required NT Server 4.0 with Service Pack 6 (SP6) or SP6a. A Windows 2000 version will be available by press time, and Marathon plans to support both OSs for the foreseeable future.

Installation
Currently, Marathon directly sells only preconfigured systems—a fairly recent change in Marathon's marketing practices. You can also purchase Endurance solutions through a network of authorized resellers. To get a better understanding of the product and participate in the installation process, I asked Marathon to send a representative to the Windows 2000 Magazine Lab to guide me through an Endurance installation.

The description of the basic Endurance software installation procedure spans 35 pages in the Endurance Installation Guide. The documentation describes the installation process fairly well, but if you're not familiar with the product's underlying architecture, you'll find ample opportunity for misunderstandings or missteps. Although I now feel confident that I could install the product without difficulty, I was glad to have an expert leading me through the process the first time—and I understand why Marathon sells only preconfigured Endurance systems.

For each IOP, I used a Compaq ProLiant DL580 server equipped with two 700MHz Pentium III Xeon processors with 1MB of cache and 512MB of Error-Correcting Code (ECC) memory. I mirrored two 18GB 10,000rpm hot-swappable disks connected to the DL580's integrated RAID controller. Using Compaq's Array Configuration Utility, I allocated this disk space to two 4GB LUNs, one for the IOP boot device and the other for the CE remote boot device. I allocated the balance of the space (about 9GB) to a third LUN, which I later divided equally into three NTFS volumes. For application data storage, a Compaq Smart Array 5300 Ultra 3 RAID controller connected additional 18GB 10,000rpm disks in an external storage cabinet. I chose Microsoft Exchange Server 5.5 as my primary test application, and I allocated four disks to a RAID 10 array for Exchange Server log files and five disks to a RAID 5 array for Information Store (IS) files. One Intel EtherExpress Pro/1000 F Server Adapter served as the connection to the client network, and another served as the private IOP-to-IOP link.

For the CEs, I used Compaq Pro-Liant DL380 servers equipped with two 866MHz Pentium III processors and 256MB of SDRAM. In each CE, I removed all cables connected to the server board's integrated IDE and RAID controllers so that the MIC, the 3.5" disk drive, and the integrated VGA controller would be the only active I/O devices in the system.

I connected all four computers to the Lab's Raritan Computer keyboard/video/mouse (KVM) network, which let me operate the computers from any of the Lab's Raritan consoles. Then, I installed a MIC card in each computer and cabled the systems together as Figure 1 shows. Next, I updated the computers' BIOS and the firmware on the MICs and RAID controller hardware so that the two CEs matched and the two IOPs matched.

To begin software installation, I did a standard install of NT 4.0 on one IOP. During installation, I gave the IOP the name that I wanted to be the public network name for the completed server array. Endurance automatically mirrors disks between IOPs at startup and during an IOP recovery. Per the installation procedure, I placed NT 4.0 system files on the CE boot disk by using a Marathon utility to copy the NT 4.0 system that I had just installed on the IOP boot disk to the LUN that I designated as the CE boot partition.

Next, I deleted the original NT 4.0 partition and performed another fresh installation of NT 4.0—this time assigning the server the name IOP1 and choosing not to install Microsoft IIS. (You must include the characters "IOP1" in the server name.) I installed the Endurance software to IOP1, then used detailed device hardware name and address information that I had collected earlier to configure the IOP. Determining the SCSI device address information as NT configures the LUNs and correctly configuring Endurance with this information are crucial to the installation process. However, the Endurance Installation Guide doesn't document this step as well as it might, so you could easily make a mistake at this point.

The process also required that I install several Endurance-related network drivers: an IOP Link Driver, an Ethernet Provider, and a Datagram Service. Because no network adapter uses all these services, I used the CE Ethernet Properties window that Figure 2 shows to disable unnecessary service bindings.

Next, I repeated the installation process on IOP2, installing NT, the Endurance software, and related network drivers. With both IOPs running, I rebooted the CEs. Using the CE desktop utility (which installed with Endurance on the IOPs), I was able to interact with the CE computers to install application software on the Endurance array. (I had configured the 3.5" drive and CD-ROM drive in IOP1 as part of the Endurance array.)

To finish the installation, I installed the Endurance Manager utility and the Endurance network redirectors on the CE desktop. The Endurance Manager displays a graphical representation of the array, color-coded to show the operational status of each component. An Ethernet Redirector supports the network adapter installed in the IOPs, and two Virtual Network Redirectors support the CEs' connections to IOP1 and IOP2. After installing the Ethernet Redirector on the CE desktop, I removed the physical Ethernet adapter driver (an Intel Pro/1000 F adapter) because the CEs would use the Ethernet Redirector to communicate with the IOPs.

On the CE desktop, I initiated a typical NT 4.0 system restart. The restart caused both of the CE computers to restart. When they came back up, I logged on to NT 4.0 on the CE desktop and started the Endurance Manager. The status windows showed that the Endurance system was operating properly.

Running with Endurance
The first time Endurance starts up, it mirrors the data from IOP1's LUNs to IOP2's LUNs. The Endurance Manager displays the current status of each mirror copy, as you can see in Figure 3. Similarly, any time an IOP comes back online following a failure, Endurance automatically remirrors the disks. You can avoid the remirroring process only by using the Server Shutdown procedure to shut down the array gracefully. (To find this option, double-click on any array element displayed in the Endurance Manager.)

After I had the Endurance array running, I decided to put it to work. I chose to use the array as a file server and as a fully configured Exchange Server mail server, so I began configuring the server for both applications. In an Endurance array, the CEs perform all application processing. When you install application software or create network shares, you do so from the CE desktop, thereby modifying the NT installation that the CEs boot. Because the two CEs run in lockstep, you see only one CE desktop. Which IOP you connect to doesn't matter.

Using the CE desktop application from IOP1, I first shared the five-disk RAID 5 volume that I had created earlier. The share promptly showed up in the other computers' Network Neighborhood. I was able to map a drive letter to the share and use it as I would any other shared network drive. Endurance passed this first test easily—from a network user's perspective, the array appeared as any other network server would.

I wanted to test Endurance using Exchange Server 5.5 with SP4, which includes an automatic Microsoft Outlook Web Access (OWA) installation and an optional Internet Mail Connector (IMC) installation. The mail server's prerequisites made the overall installation process interesting. I started by installing Microsoft Internet Explorer (IE) 5.5 and the NT 4.0 Option Pack, then reinstalled NT 4.0 SP6a. Together, these installations installed Microsoft Internet Information Server (IIS) 4.0, updated to current service releases, and prepared the server for OWA.

Reinstalling SP6a on the CE's remote boot disk was fairly complicated. The Endurance software replaces some core NT OS modules, and the act of applying a service pack overwrites the Endurance modules with NT modules. Before rebooting the CEs (as the NT service pack installation requires), the Endurance array administrator must rerun the Endurance software installation procedure and—if applicable—reinstall the Endurance software update. I typically copy software from installation CD-ROMs to a hard disk on the server, and in this case I had already copied SP6a, Endurance, and the Endurance update to the CE's C drive. With the software on a local disk, rerunning the Endurance installation procedures was easy and less time-consuming than the SP6a installation procedure. Then, I could safely click Reboot at the end of the SP6a installation procedure.

After I completed the installation of Exchange Server prerequisites, I proceeded through a typical single-server Exchange Server installation. The lone glitch occurred when I began installing IMC, which requires that the Exchange Server system use a fixed IP address. The installation failed because I had configured the Endurance array to use a DHCP-assigned address. I needed to change the TCP/IP configuration, but I didn't know how to effect the change. I didn't know whether to change the address of the Intel adapter on the IOP or the Ethernet Redirector on the CE desktop. A call to Marathon technical support led me to configure a fixed IP address for the Ethernet Redirector. I then rebooted the CEs from the CE desktop, and the IMC installation finished successfully.

A couple of quick checks verified that my Exchange Server installation was working properly. I used the Microsoft Exchange Administrator utility from the CE desktop to define a few mailboxes. To send and receive mail, I used a Win2K Professional workstation running Microsoft Outlook 2000 and IE 5.5 that was connected to the Endurance array's local subnet. After I created the necessary Exchange Server profile, I used Outlook 2000 to open one of the new mailboxes. In IE, I used OWA to open another mailbox. I was able to send mail between the mailboxes and to my company email mailbox. Everything worked perfectly.

Testing 1, 2 . . .
My first goal was to test OWA's ability to tolerate a fault on the Endurance array. I decided to create an email message from my test workstation and attach a 21MB file that resided on a server separate from the Endurance array. The file would have to trek across the network to my workstation, then to the Exchange Server system, where OWA would attach the file to the message.

While OWA read the 21MB file and attached it to the message—a 55-second process—I could break the Endurance array and see whether OWA could still attach the file. I tried five tests: I pulled the power cable from one of the CEs, restored the downed CE to operation, disconnected the active public network connection, pulled the power cord from IOP2, and restored IOP2 to operation. In each case, OWA successfully attached the file, in spite of the failure or recovery operation that I performed while OWA performed the attachment.

I ran a similar series of tests while copying an 85MB file from a Win2K Advanced Server­based Compaq TaskSmart N2400 Network Attached Storage (NAS) server console to a shared network drive resident on the Endurance array's RAID 5 volume. Using the Microsoft Windows 2000 Resource Kit's timethis.exe utility, I timed the duration of the file-copy operation during various failure and recovery operations.

In all cases, the file-copy operation completed without error. As Table 1 shows, none of the fault or recovery operations added more than 1.3 seconds to the file-copy time except when Endurance had to restore a CE to operation. In that case, my tests showed a delay of 4.5 seconds. A Marathon support technician explained that a pause in processing occurs while an IOP copies the contents of memory from the active CE to the CE that Endurance is restoring to operational status, thereby letting Endurance resume operation of both CEs at the same processing point.

Pricing and Availability
At one time, Marathon sold Endurance as a kit. Today, Marathon sells Endurance only as a preconfigured system through authorized resellers. Marathon declined to provide pricing information, instead directing me to one of the company's authorized resellers. IN ArchITechs, a division of Gain Systems, provided pricing for Compaq-based Endurance arrays. The reseller's price for a DL580/DL380-based array, with the same configuration that I tested, was $112,854. This price includes onsite installation, a 12-month Mission Critical Support Plan, all hardware mounted in a Compaq rack, and NT and Exchange Server 5.5 licenses. IN ArchITechs also provided pricing for an Endurance configuration that the company recommends for use with Exchange Server. This Compaq ProLiant­based system costs $100,182. The price includes dual 800MHz processors and 512MB of SDRAM in the Compaq DL360R CEs, one 800MHz processor with 256MB of SDRAM in the Compaq ML530R IOPs, a Compaq Smart Array 4200 RAID controller, seven disk drives in each IOP, and a Gigabit Ethernet NIC.

HP, another authorized Endurance reseller, also provided pricing for a sample configuration. A configuration that HP recommends for Exchange Server costs $88,796. This price includes a rack-mounted HP LP1000rs with dual 900MHz processors and 512MB of SDRAM as CEs; an HP LH3000s with dual 933MHz processors, 256MB of SDRAM, 12 disk drives, a dual-channel Ultra 3 SCSI 64MB cache RAID controller, and a Gigabit Ethernet NIC as IOPs; and HP onsite service. (HP's configuration doesn't include Exchange Server licensing.)

A Good Alternative
Endurance supports virtually all Win32 applications. In my tests, the Endurance array worked as advertised, permitting Exchange Server and file-copy operations to continue through simulated hardware faults and hardware recovery with no more than a few seconds of delay. The Endurance architecture is fairly complex, but the learning curve isn't any worse—and some might find it easier—than learning the ins and outs of Microsoft Cluster service.

On the downside, the version I tested supports only dual-processor CEs, and the level of fault tolerance that Endurance achieves comes at a cost, requiring four servers and twice the disk storage that a single-server solution would require—in addition to the cost of Endurance. However, if your business can't endure application outages but can tolerate the short delays that I observed during fault handling and recovery, the Endurance solution is worth your consideration.

Endurance 6200 3.0
Contact: Marathon Technologies * 978-266-9999 or 800-884-6425
Web: http://www.marathontechnologies.com
Price: $112,854 for tested configuration; $88,796 for Hewlett-Packard sample configuration
Decision Summary:
Pros: Excellent compute-through fault tolerance for off-the-shelf Win32 applications; uninterrupted application performance during server component failures; excellent technical support; available for use on a variety of Intel-based servers
Cons: More expensive than a Microsoft Cluster service implementation; complex architecture; supports only dual-processor systems as the Compute Element