Several years ago, a major luggage manufacturer ran a commercial that showed a gorilla tossing a suitcase around its cage. The commercial's message was that no matter how much you abuse this manufacturer's suitcases, they don't break open, and they continue to do their job. Clustering vendors could portray their products with similar advertising. The idea behind clustering is that if a network's primary server fails for any reason (theoretically, including gorilla attack), another server takes over for it, and the network continues to function as usual. Clustering lets multiple servers work as a unit, providing fault tolerance and continuous availability to databases, email, files, and other core business applications.
Clustering software for Windows NT has changed substantially during the past year. When the Windows NT Magazine Lab reviewed clustering products in June 1997, most NT clustering solutions used active/standby configurations, in which one server in the cluster works, and the other server stands by to take over that work if the primary server fails. Today, most NT clustering systems use active/active configurations, in which both servers in a cluster work under typical conditions. (For more information about clustering configurations and vocabulary, see "Clustering Terms and Technologies," June 1997.)
Nevertheless, clustering products differ in important ways. All clustering solutions require primary and secondary servers to have access to the same information, but different products use different methods to achieve this goal. Some products require shared disks to house common information. Other products implement the shared-nothing approach, which uses replication to automatically duplicate information between clustered servers. Each approach has advantages and disadvantages. For example, replication increases network bandwidth requirements, and disk sharing increases hardware requirements.
In addition, some clustering products offer one-to-many failover, so that the primary server can fail over to multiple secondary servers. This feature lets you balance a server's failover workload among multiple machines. One-to-many failover also ensures that users will have access to a failed server's data and applications as long as one server in the cluster is functional, even if multiple servers fail. Some clustering products offer only one-to-many replication, which is useful for backup purposes, but not one-to-many failover. Other clustering products don't offer either type of one-to-many functionality.
Microsoft's development and release of Microsoft Cluster Server (MSCS) has been a driving force behind trends in the NT clustering market over the past year. MSCS defines Microsoft's vision for clustering. MSCS uses a shared-nothing approach that requires more sophisticated hardware and software than most companies currently have. MSCS requires Windows NT Server 4.0, Enterprise Edition and fairly high-end hardware. You can compare your current system to Microsoft's Cluster Hardware Compatibility List (HCL) at http://www.microsoft.com/hwtest/hcl. (For more information about MSCS, see Brad Cooper, "Planning for Implementing Microsoft Cluster Server," March 1998, and Carlos Bernal, "Wolfpack Beta 2," June 1997.)
If your system doesn't meet MSCS's hardware and software requirements but you want the fault tolerance that clustering offers, you have two options: You can buy new hardware and software, or you can choose other NT clustering products that offer some or all of MSCS's capabilities without MSCS's system requirements. To evaluate the options available to administrators who want a cluster but can't afford MSCS's overhead, I tested six of MSCS's competitors: Computer Associates' (CA's) ARCserve Replication 4.0 for Windows NT, Vinca's Co-StandbyServer 1.02 for NT, NSI Software's Double-Take 1.5 for Windows NT, Veritas' FirstWatch 3.1.1 for Windows NT, NCR's LifeKeeper 2.0, and Qualix Group's OctopusHA+ 3.0.
The Test Configuration
Each product I tested requires a unique hardware configuration. Some of the products require shared disks; others require separate disks. Some of the products require one NIC per server; others require two or more NICs per server. Before you choose a clustering product for your system, identify all of the product's configuration requirements, or you might have to purchase additional hardware to get your system running.
My test environment consisted of three computers. My cluster's primary server, named Jerry, has two 200MHz Pentium processors. My cluster's secondary computer, named Ricky, has one 166MHz Pentium processor. Jerry and Ricky each have 64MB of RAM and a 3.1GB local IDE hard disk. I configured Jerry and Ricky in a cluster for each test and ran commands from a third computer, named Superperson, which has a 166MHz Pentium processor and 64MB of RAM, to test the cluster's failover capabilities. The network backbone for all my testing was a simple 100 megabits per second (Mbps) LAN.
For the products that require a shared disk, I installed two Adaptec 2940UW SCSI PCI adapters in Jerry and Ricky, and I connected the two machines to Storage Computer's Model R3X rack-mounted RAID 7 cabinet with eighteen 9GB hard disks (seventeen active disks and one hot-spare disk). I configured the cabinet into two partitions, which held about 68GB each, and attached two SCSI channels to each partition. This configuration gave me the equivalent of two hard disks that I could attach to both servers at once.
FirstWatch required an additional hardware configuration step: I had to bridge two hard disks together (I used 9GB SCSI Artecon LynxStak hard disks) and create a SCSI chain with one server on each end and the two hard disks in the middle. FirstWatch also forced me to reconfigure my network environment. For most of my tests, I used 10/100Mbps Intel NICs. FirstWatch required each server to have a 10/100Mbps Intel NIC and an Adaptec 6940TX Quartet four-port NIC.
After setting up a cluster that included Jerry and Ricky, I unplugged Jerry and tested each product's ability to provide three types of failover. First, I tested the product's ability to provide IP address failover. IP address failover lets a secondary server assume the IP address of a failed primary server while retaining its own IP address. The secondary server answers requests for both IP addresses while the primary server is down. Most network applications depend on IP addresses for end-to-end connections, so clustering products must provide IP address failover to be effective. To test the six products' IP address failover capabilities, I ran the ping command in an infinite loop from Superperson to Jerry. When Jerry failed, the ping command generated timeout errors until Ricky took over. When the clustering product transferred Jerry's IP address to Ricky, the ping command reported standard responses again.
Second, I tested each product for NetBIOS name failover. To achieve NetBIOS name failover, a secondary server assumes the failed primary server's NetBIOS name. This NetBIOS name trans-fer keeps shared files and printers available under their usual Uniform Naming Convention (UNC) path or network node name when their primary server goes down. Three of the products I tested use virtual name failover instead of NetBIOS name failover. They let you create a virtual name for the primary server, then transfer that name to the secondary server when the primary server fails. I used nbtstat commands to test the six products for NetBIOS or virtual name failover. I ran an nbtstat command, which displays a system's name, on Jerry before I pulled the plug. After I removed power from Jerry, I ran an nbtstat command on Ricky to verify that it picked up Jerry's name.
Third, I tested each product for SQL Server failover, which transfers responsibility for SQL Server databases from the primary server to the secondary server. SQL Server failover lets applications reconnect to a SQL Server database and access the database's information after the primary server fails. Servers assign a unique handle to each client-server connection, and a primary server doesn't share these handles with its secondary server. When the primary server fails, client software loses its connection to the server, but the client software can issue requests to establish a new connection to the primary server. Products that offer SQL Server failover transfer these requests to the secondary server. To determine whether the products offer this functionality, the Lab wrote an Open Database Connectivity (ODBC) ping program that opened an ODBC link to Jerry and issued read requests. When Jerry failed, the program asked for a new connection and continued to issue read requests. The products that provide SQL Server failover transferred the connection request and subsequent read requests to Ricky.
To evaluate the six products, I considered my failover tests' results and the complexity of each product's configuration process. I looked at the complexity of SQL Server installation on each product. Finally, I examined each product's documentation, online Help, technical support, and price. Table 1, page 88, shows the results of my analysis.
ARCserve Replication 4.0
for Windows NT
I tested the beta version (build 48) of CA's ARCserve Replication 4.0 for Windows NT. The product doesn't require unusual hardware configuration, so I configured Jerry and Ricky in a shared-nothing cluster with one network link. I encountered my only problem with ARCserve Replication when I inserted the installation CD-ROM into Jerry. I enabled Autorun, and the CD-ROM's root directory contained an autorun script, but the CD-ROM didn't run automatically. I think the problem resulted from the product's requirement that screens have 256-color resolution, a requirement that CA claims to have eliminated. I accessed the program's setup file through Windows Explorer. The installation wizard appeared and provided me with easy steps for installing the product. I didn't need to use the Help file or product manual during installation.
|ARCserve Replication 4.0 for Windows NT|| Contact: Computer Associates 800-225-5224 |
Price: $2995 (1 primary server, unlimited secondary servers) System Requirements: x86 processor or better, Windows NT Server 4.0, Service Pack 3, 10MB of hard disk space, 32MB of RAM
The product offers three installation options: ARCserve Replication Manager, which you load on one computer in your cluster's domain to manage the cluster's servers; ARCserve Replication Server, which you must load on every server that is a member of the cluster; and Alert, which you can load on any computer in the domain to receive popup alerts about cluster events. I installed all three options on Jerry, and the ARCserve Replication Server and Alert options on Ricky. Installing the product on both servers and rebooting them took me less than 5 minutes.
I opened the ARCserve Replication administration program, which Screen 1 shows, and immediately began to configure the cluster. The administration program's left pane automatically listed Jerry and Ricky as Managed Servers. My configuration proceeded smoothly, thanks to ARCserve Replication's easy-to-understand wizards. I added Jerry and Ricky to my cluster and set the software's parameters, designating which servers I wanted to fail over and setting a temporary IP address for failed servers to adopt when they come back up. (By including this temporary IP address, you alleviate potential IP conflicts when reinstating a failed server.)
After configuring my cluster, I replicated files, directories, and disks from Jerry to Ricky. I wish the product had let me exclude specific files or subdirectories from the directories I replicated. Despite this complaint, I was impressed by the fast, efficient replication process. ARCserve Replication transparently synchronizes content between primary and secondary servers, even when a user is accessing that content. I used the Replication Status and Synchronization Speed text boxes on the ARCserve Replication administration program's Task Status tab to monitor content replication from Jerry to Ricky. The replication was lightning fast, and shortly after installing the software, I was ready to test its failover capabilities.
ARCserve Replication includes an interesting failover scheme that verifies that a primary server is down before beginning the failover process. You can set ARCserve Replication to test the primary server's connections to multiple NICs on multiple machines before failing over to the secondary server. This feature prevents unnecessary failovers when a problem arises in the network connection between the primary and secondary servers but the primary server is still active.
To test ARCserve Replication's failover process, I set the software to provide failover as soon as Ricky lost track of Jerry's heartbeats, the periodic signals clustered servers emit to alert other servers in the cluster that they are functional. I started running ping commands to Jerry, then pulled Jerry's plug. In a few seconds, an alert appeared on Ricky to notify me that Jerry had failed and ARCserve Replication had begun failover. After 30 seconds, I used the nbtstat a ricky command and the ipconfig command to verify that Jerry's NetBIOS name and IP address had moved to Ricky.
Next, I tested ARCserve Replication's ability to transfer Jerry's name and IP address back to Jerry. I expected the failback process to be laborious, but ARCserve Replication includes a reinstatement wizard that leads users through the process of reinstating a NetBIOS name and IP address to their original server. I accepted the wizard's default settings and clicked Start. In less than 30 seconds, the reinstatement process removed Jerry's NetBIOS name and IP address from Ricky, removed the temporary IP address Jerry assumed when it first came back up, and reinstated Jerry's original name and IP address. As ARCserve Replication reinstated Jerry, it resynchronized communication between the two machines and initiated replication between the servers.
Next, I tested ARCserve Replication's SQL Server failover. The product currently lacks preconfigured scripts for application failover. (CA plans to include an option to automatically start services in ARCserve Replication 4.0 Service Pack 1, which CA will release soon.) You must use command (.cmd) or executable (.exe) files to start the services and set the parameters that application failover requires. To set up my SQL Server tests, I replicated SQL Server databases from Jerry to Ricky. I wrote a script that included commands for starting SQL services and saved the script on Ricky. Finally, I set SQL services to start manually on Ricky to prepare the server for Jerry's failover. I started the ODBC ping test and pulled the plug on Jerry to simulate hardware failure. In about 30 seconds, ARCserve Replication began producing nonconnection messages, and 15 seconds later, Ricky began responding to my test SQL queries.
ARCserve Replication requires you to write an ASCII script that includes appropriate NET START and NET STOP commands to start and stop NT ser-vices. I would like to be able to choose between selecting NT services from a menu and creating ASCII scripts. Nevertheless, after my tests I found myself thinking, "This version is beta?" CA appears to have incorporated clustering consumers' feedback into ARCserve Replication beta 4.0. The product sets itself apart from its competitors through its easy installation and management and its high-speed replication. In addition, ARCserve Replication works hand-in-hand with CA's backup and disaster-recovery software and Unicenter TNG network management tool. ARCserve Replication's biggest disadvantage is that it doesn't offer one-to-many failover.
Co-StandbyServer 1.02 for NT
Vinca released Co-StandbyServer for NT as a follow-up to StandbyServer for NT. (For more information about StandbyServer for NT, see Dean Porter, "Vinca StandbyServer for NT," June 1997.) You can configure Co-StandbyServer as active/active or active/standby. I chose the active/active configuration, which requires that NT's Disk Administrator can see three physical disks on each computer.
|Co-StandbyServer 1.02 for NT|
| Contact: Vinca 801-223-3100 or 888-808-4622 |
Price: $2250 per server
System Requirements: x86 processor or better, Windows NT Server 4.0, Service Pack 3, 30MB of hard disk space, 32MB of RAM (64MB of RAM recommended), Three physical hard disks
Co-StandbyServer uses a different failover method than the other products I reviewed. Rather than perform a separate failover for each resource, the product creates a failover group for each primary server, a group of the server's resources and Registry entries that fail over at the same time. During the configuration process, you assign to a server's failover group all the server's volumes, shares, IP addresses, printers, and applications that you want to fail over to the secondary server.
You can install the server module of Co-StandbyServer on two servers that act as a Primary Domain Controller (PDC) and Backup Domain Controller (BDC), two BDCs, or two member servers. I chose the third option. You must install the Co-StandbyServer client software on at least one computer in your domain.
Before installing Co-StandbyServer, I had to run Disk Administrator to assign a signature to both servers' hard disks, create two IP addresses, assign one IP address to the dedicated Vinca link adapter, and assign the other IP address to Jerry's NIC. After performing these tasks, I unbound the Windows Internet Naming Service (WINS) client and bound the TCP/IP protocol to the Vinca link adapter. I also set up a user account that was a member of the Administrators and Domain Admins groups. Then I began installing the software.
During the Co-StandbyServer installation, you select the servers you want to cluster. I selected Jerry and Ricky. I named the cluster Labtest and chose the default names for each server's failover group, Jerry-0 and Ricky-0. Next, I identified the NIC I had connected to the Vinca link adapter and the IP addresses I wanted to lock to the NIC. (The server uses these IP addresses for administrative tasks.) Finally, I entered the username and password of the administrator account I created.
I installed the Co-StandbyServer Management Console software on Jerry. (You can install Management Console, which provides administration for the cluster, on any system with access to the domain.) I rebooted Jerry, and the installation was complete. However, before I opened the product, a Vinca technical support representative called to tell me I needed to install Vinca's Service Pack 2 (SP2) to address problems in the Co-StandbyServer SQL Server failover script. I installed SP2 promptly.
I opened Management Console, which Screen 2 shows, and clicked File, Connect to create the cluster. Co-StandbyServer identified Jerry's and Ricky's resources and registered them in the Resources windows, Management Console's right panes. I configured an IP address to fail over from Jerry to Ricky: I opened Jerry's Network cards folder in the Resources window and selected Jerry's primary IP address. I opened Management Console's drop-down Cluster menu and selected the NIC on Ricky that I wanted to bind to Jerry's primary IP address. I assigned the IP address to the Jerry-0 failover group and clicked Finish. The clustered IP address soon appeared under Jerry-0 in the Cluster window, Management Console's left pane.
When I unplugged Jerry to test the cluster's IP failover, my ping command responded with Host not found errors for about 5 seconds before Ricky took over Jerry's IP address. I used the ipconfig command to verify that Ricky was responding to the IP address.
To return the failed-over IP address to Jerry, I dragged it from Ricky to Jerry in the Cluster window. The failback process was efficient, and it didn't require a system reboot.
To set up Co-StandbyServer for virtual name failover, I double-clicked the Jerry-0 failover group in the Cluster window. The Failover Group Properties dialog box appeared. I selected the Automatic Failover check box and left the Delay property at its default value, 0 seconds. Finally, I clicked OK to close the window. I initiated my ping test on Superperson and pulled Jerry's plug. After 1 minute, I ran an nbtstat n command on Ricky, and the failover group's virtual name, Jerry-0, appeared.
Co-StandbyServer offers scripts for clustering applications such as Microsoft Exchange and SQL Server. For my SQL Server failover test, I downloaded the beta SQL Server script from Vinca's Web site. After unzipping the downloaded file, I installed it in Jerry's Co-StandbyServer Applications directory. I copied SQL databases to my clustered volume and changed the databases' path in SQL Server's setup program. I opened Jerry's Applications folder in the Resources window, right-clicked the SQL Server script, and selected Cluster. In the Add Cluster Application dialog box that appeared, I entered the source computer name, the program files' path, the clustered drive's letter, the application's domain, and my administrator account's username and password. I assigned the application to Jerry-0 and clicked Finish.
I ran my SQL Server queries to Jerry, then powered down Jerry. I waited for the queries to continue, but the script wouldn't work. I called Vinca's technical support line, and a representative walked me through SQL Server failover setup. The only requirement I hadn't met was that you must install scripted applications on the same drive letter and directory on both computers. I reinstalled the script in the SQL directory on drive D on both Jerry and Ricky, but SQL Server failover still didn't work. The technical support people I spoke with were friendly, but we never got the SQL Server failover to work. Since my testing, Vinca released a final version of the SQL Server failover script, which Vinca says addresses the problems I had.
The Co-StandbyServer failover groups offer a unique failover method, but the product's system requirements (three dedicated hard disks per computer), vague setup instructions, and unreliable SQL Server failover make it a problematic clustering solution. Vinca offers 30-day trials of Co-StandbyServer; maybe you'll have better luck with the product than I did.
Double-Take 1.5 for Windows NT
NSI Software has given Double-Take for Windows NT a makeover since the Lab last tested the product. (To read that earlier review of Double-Take 1.3 beta, see Carlos Bernal, "Double-Take 1.3 Beta," June 1997.) The beta version I tested came on one CD-ROM with a printed manual. Double-Take didn't require unusual hardware configuration, so I installed the software and configured Jerry and Ricky, with one NIC each, in a shared-nothing cluster.
|Double-Take 1.5 for Windows NT|
| Contact: NSI Software 888-230-2674 |
Price: $1875 per source server
System Requirements: x86 processor or better, Windows NT Server 3.51 or later, 6MB of hard disk space, 16MB of RAM (32MB of RAM recommended)
When I inserted the CD-ROM into Jerry, the installation wizard appeared and walked me through the installation process. A Custom installation option lets you choose monitoring and failover options, but I chose the Complete installation option. The installation process was easy, and it let me configure startup settings for Double-Take services. I chose the default settings, which included starting the Double-Take Source and Target and the Failover Source and Target services automatically. All four services must be running for Double-Take to fail over properly. I created placeholders on Ricky that Double-Take could assign Jerry's IP addresses to during system failover. The number of placeholders must exceed the maximum number of IP addresses Ricky might assume at once; I selected 20, the default setting. Double-Take's documentation didn't clarify whether I needed to manually configure the placeholders or Double-Take would configure them. I called NSI's technical support and learned that Double-Take dynamically installs the specified number of placeholders. The rest of the installation was uneventful. After restarting Jerry, I began the configuration process.
From the Start menu, I opened the Double-Take Client (DTClient), which configures and manages Double-Take servers. DTClient automatically detected and displayed Jerry and Ricky, the two computers I had installed Double-Take software on, as Screen 3 shows. The DTClient automatically detects only computers on the same network segment; you must add all other computers manually.
I selected Jerry to serve as the source server and chose the files I wanted Double-Take to replicate. You must create your replication set before connecting your source and target computers. I double-clicked Jerry in the Sources list in DTClient's left pane, which opened the Replication Set Explorer dialog box. I was immediately comfortable with the Explorer-like interface and additional check boxes, which let me select drives, directories, or subdirectories for replication. I chose the Data subdirectory of MSSQL as my replication set so that Double-Take would replicate Jerry's SQL Server databases to Ricky. I clicked Exit, Yes to save the replication set.
I connected my source and target machines by right-clicking Jerry on the DTClient's Sources list and selecting Manage Connections from the drop-down menu. The Connection Manager opened. I selected Jerry from the Sources drop-down menu, then selected Ricky from the Targets Available text box and clicked Connect. Double-Take connected Jerry and Ricky, and I clicked Start to start the replication process. Double-Take finished the replication quickly.
I opened the Failover Control Center through Ricky's Start menu. I selected Ricky from the Target Machine drop-down menu and clicked Add, then entered Jerry as my source computer's name. I selected the Jerry check box to indicate that Jerry was the source computer Double-Take must provide failover for. I also entered paths to post-failover and pre-failback scripts I had created. You create these scripts to start and stop services during failover and set other instructions or parameters. Ricky was soon monitoring Jerry; the Failover Control Center displayed a green light next to Jerry's name and primary IP address to demonstrate that Jerry was functional.
When I pulled the plug on Jerry to simulate a hardware failure, Ricky's Failover Control Center provided information about the failover process. The program's lights for Jerry and its IP address turned yellow and then red. The Failover Control Center provided status messages throughout the failover process, and it let me know when Ricky took control of Jerry's NetBIOS name and IP address. I used ipconfig and nbtstat -n commands on Ricky to verify the failover.
Failback is simple in Double-Take, and the manual clearly documents the process. I highlighted Jerry's name in the Failover Control Center's Monitored Machines list and clicked the Failback button. When I powered up Jerry, both servers returned to their pre-failover state.
Next, I tested Double-Take's SQL Server failover capability. My post-failover script included commands that started the MS-SQLServer and SQLExecutive services on Ricky, so I configured the services to manually start. I verified that Double-Take had replicated Jerry's Data directory to Ricky, and I started sending queries to Jerry. Then, I induced hardware failure on Jerry and waited for Ricky to take over. In about 45 seconds, Ricky began responding to my test queries without a hitch.
Double-Take was successful in all my tests and was easy to install and configure. The Failover Control Center provided informative status reports about Jerry. The one-button failback was wonderful, and it doesn't require you to reboot the target machine to release the source server's IP address and NetBIOS name. Double-Take's easy-to-use features and affordable price make it a powerful clustering solution.
FirstWatch 3.1.1 for Windows NT
Veritas forged into the NT clustering market by porting FirstWatch for Solaris to NT. FirstWatch 3.1.1 for Windows NT comes on one CD-ROM with a bound manual. FirstWatch's substantial hardware requirements surprised me. Both the active/standby and active/active configurations require multiple network cards, external hubs, and shared disks. I chose the active/active configuration and gathered the necessary hardware (including one single-port NIC and one Adaptec 6940TX Quartet four-port NIC for each computer). FirstWatch's hardware requirements translated into a daunting and time-consuming configuration process. Fortunately, the manual provides an Installation and Configuration checklist and other worksheets that helped me plan my cluster's configuration.
|FirstWatch 3.1.1 for Windows NT|
| Contact: Veritas 650-335-8000 or 800-258-8649 |
Price: $2475 per server
System Requirements: Pentium processor or better, Windows NT Server 4.0, 15MB of hard disk space, 32MB of RAM (64MB of RAM recommended), Five NICs per server, Shared hard disk
I installed the FirstWatch software on Jerry and Ricky. Using the installation wizard, I chose Jerry as Symmetric Server A and Ricky as Symmetric Server B. I selected the NICs for each computer that I wanted to serve as the Heartbeat 1 and Heartbeat 2 connections. The wizard warned me that after I selected my Heartbeat choices, I could change them only by reinstalling the entire product. The planning worksheets I filled out before installation were invaluable at that point.
Next, I selected the NICs for the Primary Interface and Takeover Interface. I modified my Takeover NIC's information to match Jerry's primary IP address and host name. Here the manual becomes confusing. It states that you must preconfigure the Primary and Administrative NICs with real IP addresses, but that you need to establish only placeholders for your other three NICs (the Heartbeat and Takeover NICs). The manual clearly states that FirstWatch will assign real IP addresses to the NICs you give placeholders to; however, I discovered later in the installation process that you must eventually change the placeholders to real IP addresses. Because of the product's complicated setup and confusing documentation, I spent 2 days on planning and installation before I could open the FirstWatch software.
I selected Configure FirstWatch from the Start menu. You use the Configure FirstWatch program to configure each computer's network connection, shared disks, shares, and FirstWatch Agents, which are sets of scripts and programs that monitor a service and take action when that service fails. I configured my shared disk on Jerry first. I used the Disk Administrator to create two partitions on the shared disk. Then I selected each partition and clicked Configure. You can make each partition a Primary, Takeover, or Unshared disk; I chose Primary for each partition. I went through the same shared-disk configuration steps on Ricky, except that I chose Takeover for the partitions. The configuration program reminded me that FirstWatch requires you to remove drive letters from Primary and Takeover disks, so I followed the manual's instructions for removing the shared disks' drive letters.
FirstWatch uses two tools to monitor a cluster: the DOS-based High Availability monitor (HAmon), which Screen 4 shows, and a Web-based user interface (UI). The Web-based UI is currently the only option for remote administration of FirstWatch.
After setting up my FirstWatch cluster's shared disk and failover configurations, I tested simple IP address failover. When I powered down Jerry, the continuous pings immediately registered a Host not found error. About 8 seconds later, the pings started responding again. I used ipconfig on Ricky to verify that the server was hosting Jerry's primary IP address. I failed back the original IP address to Jerry through the HAmon by selecting Manage a FirstWatch Server, Online Primary, and Ricky. FirstWatch released control of Jerry's IP address and set Ricky and Jerry back to the state each was in before the failover (the state that FirstWatch calls Online Primary). FirstWatch didn't require me to reboot either machine during failback.
FirstWatch 3.1.1 does not offer failover for servers' NetBIOS names, but Veritas says it has added NetBIOS name failover to FirstWatch 3.2, which will be available soon. FirstWatch 3.1.1 provides failover for virtual names that you set up in the Configure FirstWatch dialog box on each server. On Jerry, I selected the Primary option and entered Jerry in the Name text box, then clicked Add. Next, I selected the Takeover option, typed Ricky in the Name text box, and clicked Add. I went through the same steps on Ricky, reversing the Primary and Takeover names. I opened HAmon and changed Jerry and Ricky to Online Primary status. I started my auto-ping program on Superperson, powered down Jerry, and waited for Ricky to take control of Jerry's virtual name. When Ricky began responding to the pings, I ran an nbtstat n command on Ricky and found that Ricky had control of Jerry's virtual name.
Next, I tested FirstWatch's SQL Server failover. I installed SQL Server on Jerry and Ricky, copied my database to a shared disk, and changed my SQL Server options to point to the database on the shared disk. I started my SQL query program and pulled the plug on Jerry. The query program sent connection failure messages for about 20 seconds, then began accessing the database again. FirstWatch passed my tests with flying colors.
FirstWatch worked efficiently once it was running. I like that the product offers easy failback without a system reboot, and the preconfigured Agents helped me set up a variety of applications. However, FirstWatch's configuration process is long and arduous, and the manual's unclear instructions make configuration even more difficult. The product's substantial hardware requirements are expensive, and its installation and configuration processes are more difficult than the installation and configuration of the other five products I tested.
When I first saw NCR's LifeKeeper 2.0 software, I shuddered at the package's size. The product's seven CD-ROMs and bound manual made me expect installation to be difficult. I was pleasantly surprised to discover that LifeKeeper's core programs were on one CD-ROM and that the other six CD-ROMs contained software recovery kits.
| Contact: NCR 937-445-5000 |
Price: $750 per single- or dual-processor server
System Requirements: x86 processor or better, Windows NT Server 4.0, 66MB of hard disk space, 32MB of RAM, 800 x 600 screen resolution
You can use LifeKeeper with shared disks or NCR's Extended Mirroring, a separate replication product. The first method provides active/active clustering; the second method provides active/standby clustering. I tested the shared-disk, active/active method.
To create an active/active LifeKeeper cluster, you must configure at least one shared SCSI disk into partitions using Disk Administrator before you install LifeKeeper. I configured the shared disk, then began the installation process on Jerry. I inserted the installation CD-ROM and used Explorer to open the setup file. I entered my installation directory and clicked OK. The software began installing. I followed the same steps to install the software on Ricky.
After the installations were complete, I opened the Hardware Status and Administration program through Jerry's Start menu and created a communication link between Jerry and Ricky: I clicked the CommPath option linking Jerry and Ricky, and then Create to open the Communication Path Management dialog box. I elected to establish a Socket path (Serial port and Shared disk paths are your other choices). I identified Jerry as the source server and Ricky as the destination server. I selected Jerry's primary IP address as the Local IP Address and Ricky's primary IP address as the Remote IP Address. Finally, I selected the Create Bidirectional Communication Path check box, which activates the path between the source and destination servers. Soon, LifeKeeper's Hardware Hierarchical Administration program displayed a live link between Jerry and Ricky.
Then, I accessed LifeKeeper's Hierarchy Administration program through Jerry's Start menu, and selected Create Hierarchy from the Admin menu. The menu that appeared lists the types of hierarchies you can create. The product's two default choices are User-defined and Volume. The menu includes IP Address and SQL Server options after you install those modules. I selected Volume, and the Create Volume Hierarchy dialog box opened. I selected Jerry as my primary system and Ricky as my secondary system. I also selected the Automatic Switchback check box for Jerry. The Automatic Switchback option tells LifeKeeper to bring the Volume resource back into service when a system comes back up after a failure if the resource was running when the system failed. I selected the drive letter I wanted to identify the shared SCSI disk with and verified that the Bring in Service check box was selected. When I finished setting up the parameters, I clicked OK. The Volume hierarchy soon appeared in the Hierarchy Administration window, as Screen 5 shows.
Next, I installed LifeKeeper's IP Address and SQL Server modules. The installation wizard could have saved me this step by prompting me at the beginning of LifeKeeper installation to select which modules I wanted to install. I read the manual's chapter about the IP Address Hierarchy and realized that I needed to configure more IP addresses on Jerry and Ricky before I installed the IP Address option. Instead of connecting to a server's primary IP address, LifeKeeper users access the primary server through a switchable IP address. When the primary server fails, LifeKeeper transfers the switchable IP address to a secondary computer, where it replaces an IP placeholder. I had to create a switchable IP address for each server, then create placeholders on each server for the other server's switchable IP address to move to during failover. I opened the Control Panel Network applet and configured three placeholder IP addresses on each NIC: one for Jerry, one for Ricky, and one for a second NIC for local recovery. I expanded the system HOSTS file to include all of both systems' IP addresses, and I included the switchable address in my Domain Name System (DNS). Next, I created an IP address hierarchy in the Hierarchy Administration program. I consulted the manual to fill in the parameters in the Create IP Hierarchy dialog box. I assigned placeholder addresses to Jerry and Ricky from the Placeholder IP Address drop-down list, which listed the placeholder addresses I had just created, and I selected the Automatic Switchback check box for Jerry.
I tested the cluster's IP failover by sending a continuous ping to Jerry's switchable IP address. When I pulled Jerry's plug, the pings failed for about 8 seconds, then resumed. I verified the IP failover by issuing an ipconfig command on Ricky. The switchable IP address was under Ricky's control.
Instead of failing over a NetBIOS name, LifeKeeper's LAN Manager module fails over a virtual name. You must install the LAN Manager module before you configure virtual name failover. I installed the module easily. I ran the continuous ping test and pulled the plug on Jerry. About 45 seconds later, the pings started again. I ran an nbtstat n command on Ricky and saw that Ricky had control of LK0-JERRY, Jerry's default virtual name.
Finally, I tested LifeKeeper's SQL Server failover. The LifeKeeper SQL 6.5 Recovery Kit, which includes the SQL Server module and an instruction guide, made this test easy. The instructions are detailed and clear. I followed the installation steps to set up the SQL Server module. I chose Admin from the Hierarchy Administration screen, clicked Create Hierarchy, and selected SQL Server. The SQL Hierarchy Creation dialog box appeared. I chose my primary and secondary servers, Automatic Switchback, the SQL Server system ID, and a password. Then, I selected the SQL Hierarchy tab. I accepted the dialog box's default settings, and I clicked TCP/IP Socket and selected Jerry's primary IP address to create a dependency relationship between the IP address and the SQL instance. You must create this dependency relationship for LifeKeeper to fail over both the IP address and the SQL instance to the secondary server.
I set up my SQL Server failover test, then pulled Jerry's plug. After about 45 seconds, connection failure messages appeared, and after another 30 seconds, the SQL queries were again successful. I used an ipconfig /all command to confirm that Jerry's IP address and virtual name had failed over to Ricky.
LifeKeeper 2.0 is a competitive product. The recovery kits are easy to install. The administration modules are difficult to understand at first, but within a few days I was comfortable with them. The product's documentation is thorough, and it answered many of my questions. However, installation would be much less time-consuming and frustrating if the beginning of the manual included a short section of basic information about the steps you need to take to install each LifeKeeper module. Since I reviewed LifeKeeper, NCR has released NCR Enterprise Pack 1.0, which the company claims provides a wizard that makes installation easier.
OctopusHA+ 3.0 is Qualix Group's clustering solution. The product comes on a CD-ROM with a user's guide. I was surprised to receive two copies until I realized that you must purchase two copies of the software to create a cluster. OctopusHA+ didn't require unusual hardware configuration, so I configured Jerry and Ricky in a shared-nothing cluster.
| Contact: Qualix Group 650-572-0200 or 800-245-8649 |
Price: $1499 per server
System Requirements: x86 processor or better, Windows NT Server 3.51 or later (SASO option requires NT Server 4.0), 5MB of hard disk space for program files, plus hard disk space equaling 10% of the size of the data you mirror, 32MB of RAM
Before installing OctopusHA+, I had to complete two preinstallation tasks. I set up a user account for the installation process, then followed the manual's detailed instructions to verify that the account I planned to run the OctopusHA+ service under had the appropriate permissions.
When I completed the preinstallation tasks and verified that OctopusHA+ would run under my new account, I thought I was ready to install the software. However, the manual informed me that I had to secure a registration number from Qualix Group's Web site for each CD-ROM. I filled out a form on the Web site, and about 5 minutes later, I received an email with my registration numbers. I found this extra step disconcerting, because after half a day of working with OctopusHA+, I still hadn't installed the software.
I finally inserted the CD-ROM into Jerry, and the installation wizard opened. The installation process is quick, and it lets you push the software to servers across the domain. I clicked Get, and OctopusHA+ pulled up a list of the active servers in my domain. I selected Jerry, installed the software, clicked Install Another, and chose Ricky. Before long, the software had installed, and I rebooted both computers to start the OctopusHA+ service module.
I started the software's administration program, which Screen 6 shows, and set a specification. A specification is the path or path and filename OctopusHA+ requires to mirror a directory, share, or file from the source computer to the target computer. You can create as many specifications as you want, and you can exclude certain files, directories, or types of files from specifications. An entire chapter of the OctopusHA+ manual covers the process of setting up specifications. I clicked Maintenance, Add Specification to set specifications to two directories: the Data directory for SQL Server 6.5 and a directory called Test. I verified that the Mirror Files option was selected, then entered the Data directory's path and clicked Set Perm to set permissions for my specification's directories and files. I clicked Sub Tree to include all subdirectories that the Data and Test directories contain. I chose Ricky as my target site, entered the path of the files I wanted mirrored, and clicked Set Perm. I clicked Synchronize, then OK. Shortly thereafter, synchronization began, and the software gave me status information about the synchronization. I received a message when the synchronization finished. I enabled the Forward and Mirroring functions, so OctopusHA+ continuously mirrored data between Jerry and Ricky.
Next, I enabled Jerry and Ricky for IP address failover. OctopusHA+ accomplishes IP address failover using Automatic Switch Over (ASO--OctopusHA+'s active/standby configuration) or Super Automatic Switch Over (SASO--OctopusHA+'s active/active configuration). The failover steps are easy in principle but frustrating in practice. You start the OctopusHA+ Client and attach it to the source system. You configure the source system's failover options. Then, you attach to the target system and configure its failover options. OctopusHA+ software must be running on the source and target machines.
The manual outlines how to set up SASO, but the instructions are ambiguous and out-of-sequence in places. I opened the OctopusHA+ Client on Jerry, clicked Clustering, and selected Source Options. I selected the IP Addresses to Forward tab and double-clicked Jerry's primary IP address. I clicked the Cluster to tab and selected Ricky from the Target Site drop-down menu. I clicked OK. I still had to attach to Ricky to configure its SASO options. I clicked Clustering, Target Options. The Clustering Options for the Target System dialog box appeared. I clicked Add Site (SASO) and kept the default settings. Next, on the Services tab I selected the MSSQLServer and SQLExecutive services. I clicked Add and selected Ricky as the system for OctopusHA+ to start the services on. Then on the Account tab I entered my servers' domain and my OctopusHA+ account name and password.
I began my IP address failover test. I cut power to Jerry, and the pings reported finding no host. About 45 seconds later, the pings responded successfully. I used the ipconfig /all command to verify the failover; Jerry's primary IP address was under Ricky's control.
Next, I failed back the IP address to Jerry: I opened the OctopusHA+ Client on Ricky and clicked Target Options in the Clustering menu. I selected the Take Over tab and clicked Jerry's name in the Added Names text box. I clicked Remove, then selected the Option tab. I selected the Disable option and rebooted Jerry. The IP address moved back to Jerry.
After Jerry rebooted, I wanted to test OctopusHA+'s SQL Server failover capabilities. The manual states that the product offers SQL Server and Exchange failover, but it contains no instructions explaining how to set up SQL Server failover. I visited Qualix Group's Web site and found no information about setting up applications for failover. I tried to set up SQL Server failover without instructions by piecing together the most likely configuration. I stopped the MSSQLServer and SQLExecutive services on Ricky and set them to start manually. I verified that I had mirrored my SQL Server database to Ricky and that the directories the database resided in had the same name on both servers.
I started my SQL Server test on Superperson. When I pulled the plug on Jerry, I received nonconnection errors. I waited 1 minute, but the SQL queries eventually failed. I double-checked my SQL services settings, verified that communication between my test systems was active, then called Qualix Group's technical support. The representative I talked to was helpful, but we could not get the SQL Server failover to work.
OctopusHA+ offers many setup options, but its UI includes ASO and SASO options in the same dialog box, which can be confusing. The product's failback process is complicated; it would be much easier if OctopusHA+ offered a wizard to walk users through the failback process. The manual is large, but it doesn't provide much of the help I needed. I recommend that you get a copy of the OctopusHA+ demonstration CD-ROM and test the software on your network. Maybe you'll find the product less frustrating than I did.
Picking the Winner
I tabulated the results of my tests and reviewed all six clustering products' features; Table 1 shows the results. I gave the Editor's Choice award to NSI's Double-Take for Windows NT. This product supports NetBIOS name, IP address, and SQL Server failover, and it contains an easy-to-use replication tool. It is simple to install and use, and it supports one-to-many replication and failover.
Double-Take won in my tests, but no matter which clustering product you choose for your network, you can come out a winner. Given the power and flexibility of all of today's clustering solutions, you can easily guard your network from gorillas that threaten to bring down your key applications. That protection is worth a lot of bananas.