Clustering solutions let you boost the availability, reliability, and scalability of Windows-based servers. A typical clustering configuration that addresses availability consists of two or more like nodes, typically connected to a shared storage subsystem. When these nodes function in an active-active configuration, the workload is distributed between the cluster nodes; however, the typical purpose of a cluster is to permit failover in the event of an application or hardware failure—an active- passive configuration.
A variation on the active-passive clustering configuration uses data replication instead of shared storage. Data replication offers the possibility of clustering across a wide-area link. Products that support this type of cluster offer many appealing benefits beyond the disaster tolerance that geographically dispersed data affords. For example, the ability to specify different hardware configurations for each node decreases hardware costs. Additionally, the existence of a replicated data source can be extremely useful when you're performing data backup, application testing, and OS migration and testing.
I gathered three sample products that accommodate replication-based clustering with failover—Computer Associates' (CA's) Bright-Stor High-Availability Manager (formerly known as SurviveIT 2000); NSI Software's Double-Take for Windows 2000/NT 4.1; and Legato Systems' Legato Octopus for Windows NT and 2000 4.2. (Legato has since renamed its Octopus product Legato RepliStor and released version 5.0.) I tested these products' ability to maintain recoverability and high availability in a variety of failure situations. I also looked at unique capabilities that might make each product more suitable for a given application. VERITAS Software's VERITAS Cluster Server 2.0 for Windows 2000 and SteelEye Technology's LifeKeeper for Windows 2000 4.0 offer similar capabilities but were unavailable for testing.
The Test Environment
For the test environment, I used SHUNRA Software's SHUNRA\
Cloud to emulate a T1 frame relay circuit between the two cluster nodes. The primary node was a Hewlett-Packard HP NetServer LT 6000r with six 550MHz Pentium III Xeon processors running Win2K Advanced Server as an application and file server. The secondary node was a 2-way 550MHz Pentium III server, also running Win2K AS.
To test failover capabilities, I used three unique scenarios for the server: that of a typical file server, that of a Microsoft Exchange Server 5.5 server, and that of a Microsoft SQL Server 7.0 server. For each scenario, I removed network connectivity to cause a node failure, then performed a manual failover (typically a button or menu option). During failover, I monitored the time each product required to detect and act on a failed node and evaluated the effectiveness of each product's procedures for recovering from a failure and starting a manual failover.
Installation and Configuration
As part of its simple installation process, BrightStor High-Availability Manager prompts you to select which of the software's four components—Server, Console, Alert, Application Notes—you want to install. (The product documentation recommends that you install all four components.) After the 2-minute installation on my primary node, I needed to restart the server. I then repeated the installation process on my secondary node.
Double-Take's installation was equally straightforward, but the software also prompted me to answer a few optimization-related questions. I specified the size and location of a pagefile that Double-Take would use to buffer data during heavy replication traffic, and I specified that I would be using transactional applications (SQL Server and Exchange). The installation process created security groups for Double-Take administration and quickly installed the necessary files before prompting for a required reboot. I repeated the procedure on the secondary node.
Octopus's installation process was quick and easy, letting me install the software on the local and remote servers from one location in a matter of minutes. The process required minimal configuration and didn't require a server restart.
|BrightStor High-Availability Manager|
| Contact: Computer Associates * 631-342-6000 or 800-225-5224 |
Price: $2495 per replicated server
Data Replication and Mirroring
BrightStor High-Availability Manager uses replication tasks to define which data needs protection. You use the console's Replication Task Wizard to set up a new replication task. The wizard prompts you to select the server you want to protect, choose the server that will hold the replicated data, and choose from a range of network speeds for the link between the two servers. The wizard also prompts you to choose between Full Protection (with failover capabilities) and Data Protection Only and to select the folders to be replicated. I tested only the Full Protection option. Next, the wizard guides you through configuring communication-failure detection parameters, then displays a summary screen that contains details about the replication task you defined. From the summary screen, you can click Advanced Edit to open the Task Editor, which lets you view and modify a given task's steps. Finally, the wizard prompts you to start the new replication task.
Octopus offers five templates—called specifications—to help you define data-replication parameters. You use the Octopus Client, which Figure 1 shows, on the source computer to choose a specification. The File/Directory specification replicates specific files, shares, or directories. The Global Exclude specification, which takes precedence over all other specifications, lets you exclude certain files from replication. The Share specification replicates shares but not the contents of those shares. The Registry specification lets you replicate registry data to the secondary server. The DFS specification (for NT only) lets you define DFS shares that you want to replicate. After you choose one of the specifications and select the objects that you want to replicate, the software displays an Options dialog box, in which you can fine-tune file-protection, -permission, -deletion, and -exclusion characteristics. Finally, in the Synchronization window, you choose when and how to synchronize the data that will be replicated.
The first time you open Double-Take's Management Console, the software launches the Connection Wizard. This wizard helps you establish your source and target servers and configure a Replication Set. When the wizard is finished, you can click Advanced Options to bring up the Connection Manager window, in which you can adjust mirroring, scheduling, and failover options. If you want to deploy this product in a bandwidth-limited environment, you'll find Connection Manager's Transmit tab particularly interesting: On this tab, you can set bandwidth percentage limits, time limits, and byte limits to throttle bandwidth usage. Of the three tested products, Double-Take offers the most bandwidth control. After you close the Connection Manager window, the new Replication Set appears in the Management Console.
|Double-Take for Windows 2000/NT 4.1|
| Contact: NSI Software * 317-598-1174 or 888-674-9495 |
Price: Starts at $2495 per server; Double-Take
for Advanced Server costs $4495 per server
How well the products detect a failure, perform the failover, and permit you to reinstate a repaired server directly determine the value that the products offer to your organization. For a comparison of key features, see Table 1, page 68.
Octopus can automatically transfer processing from a source system to a target system if the source system fails. Failure detection occurs through heartbeat communication between nodes; if you need more advanced triggering, you can use a third-party systems management application to monitor your systems and execute scripts to perform actions when necessary.
The Octopus client also lets you perform manual failover. Two failover methods, as well as a variant, are available. Automatic Switch Over (ASO) replaces the target machine's identity with that of the source, and Super Automatic Switch Over (SASO) adds the source's identity to the target machine while maintaining the target's identity. The variant, SASO - Alias, passes an alias, or virtual identity (consisting of a computer name and IP address), between the source and target. This method accommodates failover between source and target computers that exist on separate subnets and simplifies the process of recovering from a failover.
In Double-Take, the target system monitors the source system for a failure by sending requests at a user-specified interval. When the source system is alive and well, it sends a reply. If the target system doesn't receive a reply, it counts a missed packet. (This functionality is similar to that of the Ping command.) The user-supplied value for the allowable number of missed packets multiplied by the monitor interval equals the failover timeout. On the Connection Manager's Failover tab, I set these values to three missed packets and a 5-second monitor interval—similar parameters to those of the other products I tested. Double-Take's Failover Control Center (which Figure 2 shows, along with the Double-Take Management Console) displays monitoring statistics and lets you control how and when failovers occur. You can specify additional IP addresses to monitor and make failover contingent on the failure of one or all of the monitored addresses. You can also require user intervention before the software initiates a failover.
Compared with the other two products, BrightStor High-Availability Manager offers the most options for detecting problems and initiating failovers. The intelligent failover option permits participating servers to ping a known device (e.g., a router) to determine the location of a communication problem and act accordingly. Additionally, low disk space or a bad disk can initiate failovers. You can alter these settings in the Task Editor by editing parameters under the Failure Detection icon. During a failover, the software transfers the primary server's IP address to the secondary server, and the primary server assumes a user-defined name and IP address until it's ready for reinstatement.
|Legato Octopus for Windows NT and 2000 4.2|
| Contact: Legato Systems * 650-210-7481 or 888-853-4286 |
Price: $2499 per server
For each product, I performed network file operations against the primary file server while testing manual failover and loss-of-connectivity functionality. Manual and loss-of-connectivity failovers worked as expected on Octopus, each requiring about 40 seconds for the target server to stand in for the source. The time the software required to detect a connection failure amounted to 10 seconds of that time. During a failover, a stream of file copies flows to the file server. Any file-copy operations that time out—or fail to execute—during this period are considered lost. Octopus lets you specify a Max Wait Time value in the Switchover menu's Target Options option. I set this value to 10 seconds—if the target system can't contact the source in 10 seconds, the target system concludes that the source has failed. This value worked in my test environment, but if you use only Octopus's built-in failover-trigger mechanism, an unreliable real-world WAN link could compromise the dependability of your solution. Manual or scripted failovers are probably a safer bet.
To return to original server definitions following a SASO - Alias failover, you remove the source system name from the target system's Added Names list, then reenable and synchronize appropriate specifications on the source system. This process is much simpler than the SASO and ASO recovery processes, which require more configuration of names and IP addresses.
BrightStor High-Availability Manager performed well with a manual failover, losing one file operation, but the loss-of-connectivity test resulted in an approximately 40-second delay before the secondary server took over from the primary server. The timing for detecting communication failures is a component of the replication task's link speed setting. CA warns against exaggerating the link speed because doing so can cause false failure detection. To reinstate the servers to their original roles, you access the main BrightStor High-Availability Manager screen on the target system, go to the Server menu, and run the Reinstate Wizard. The wizard asks you to specify which server you want to reinstate. Then, you can choose to schedule the operation, warn users before the operation, and restart the replication task after the reinstatement is finished.
Both the failover and reinstatement procedures automatically perform a reboot of the primary server—not a concern in the case of a failover, but note that following reinstatement, no server will act as primary server until the original primary server is completely rebooted. In my environment, the reinstatement and reboot finished in less than 4 minutes. BrightStor High-Availability Manager's failover and reinstatement process is more automated and user-friendly than those of Octopus and Double-Take.
Double-Take's manual failover process occurs almost instantaneously. The automatic loss-of-connectivity failover involves a fairly long Time to Fail value, but if you know failover is imminent, you can click the Failover button to shorten the wait. During a manual failover, the source computer's name and IP address don't change and the computer isn't removed from the network. Therefore, IP address and name conflicts occur. To return to normal after a failover, I clicked Failback in the Failover Control Center. Doing so immediately removed the source server's identity from the target, and the software prompted me to restart or stop monitoring of the source server. After restarting the source server, I instructed Double-Take to restart server monitoring. At this point, two different sets of data existed on the source and target systems, and I could either copy individual files manually from the target to the source or use Double-Take's Restoration Manager to restore the entire replication set. I chose the manual copy method, which restored all file operations that weren't lost during the failover timeout period from the target server to the source server.
|VERITAS Cluster Server 2.0 for Windows 2000|
| Contact: VERITAS Software * 650-527-8000 |
Price: Starts at $4995 per server; bundle pricing available
SQL Server Failover
Legato's Web site offers application scripts that facilitate failover configuration for servers running SQL Server or Exchange. I downloaded a self-extracting executable file that contained the necessary scripts and documentation for configuring both SQL Server and Octopus. I followed the documentation's recommendations for configuring the SQL Server machines, customizing the scripts, and creating a new specification for mirroring SQL Server data directories to enable effective failover. After discovering discrepancies between the scripts and the documentation's script examples, I called Legato support for clarification. A Legato support engineer sent me a simplified set of instructions and an upgrade to Octopus 4.2, build 330b, which let me use SASO - Alias to accomplish the SQL Server failover. After the 30-second manual failover process, an error message informed me that certain files were open during the failover and therefore warranted inspection before using them in production. The process of failing over and reverting back seemed easy after I repeated the process a few times, but I recommend significant testing and documentation of the procedures specific to your environment to ensure success.
As I mentioned in the Installation and Configuration section, BrightStor High-Availability Manager loads Application Notes on the server during a typical installation. To configure the servers for SQL Server failover tests, I referred to Application Notes for Microsoft SQL Server 6.5 and 7.0. I ran the Replication Task Wizard to establish initial parameters for a task (i.e., select replicated folders and enable IP failover), then clicked Advanced Edit to open the Editing Task window and the BrightStor High-Availability Manager console, which Figure 3 shows. In the Task Editor, I set up a Workload to replicate my SQL Server data directory, specified destination paths for the data, and assigned the provided SQL Server script to failover and reinstatement actions on both servers. (The script includes code for both failover and reinstatement.) Failover times were similar to those in the file server tests, but the loss-of-connectivity test resulted in a few more missed transactions than the manual failover test. The failover and reinstatement processes for SQL Server performed flawlessly, requiring little preparation or intervention.
To configure SQL Server and Double-Take in my test environment, I downloaded the High Availability for Microsoft SQL Server Using Double-Take 4.x application notes and associated scripts from NSI Software's Web site and simply followed instructions. Configuring Double-Take's SQL Server failover and failback required more manual operations than BrightStor High-Availability Manager did, but the steps were straightforward and easy. However, I would have found the documentation easier to use if it had more clearly described which server I needed to perform certain actions on. I opened the Failover Control Center and clicked Edit Monitor. In the Monitor Settings window, I edited the two downloaded scripts and specified when Double-Take should execute them.
Double-Take responded with a quick failover process. After the failover timeout expired, the target server assumed the source server's identity, SQL Server services started up, and the target server started processing transactions. To fail back to the source server, I followed the same steps that I performed for file-server failback, except that this time I used the Restoration Manager to restore the entire replication set of data from the target system to the source. The software restored to the source server all file operations that weren't lost during the failover timeout period.
|LifeKeeper for Windows 2000 4.0|
| Contact: SteelEye Technology * 650-318-0108 or 877-319-0108 |
Price: Starts at $1400 per server
To configure the Exchange failover test for Octopus, I used the documentation and files from Legato's Application Script for Microsoft Exchange. Because the documentation discusses only the use of SASO without aliases, I used that method in my tests. Fortunately, the documentation's sample scripts matched the contents of the provided files. The process consists of creating several registry specifications to perform a one-time synchronization of Exchange-specific registry data between the two nodes and creating a File/Directory specification to synchronize and mirror Exchange data. You also need to configure before and after scripts to be executed on the target server during the failover and recovery phases. Carefully analyzing and implementing the lengthy list of steps took about an hour. The 1-minute failover went smoothly. I then needed to restart my mail client (Microsoft Outlook 2000) to reestablish communication with the stand-in mail server; after doing so, the failover was invisible from the user's perspective. The recovery process also works well—as long as you pay careful attention to the detail and order of the instructions.
To configure the Exchange failover tests in BrightStor High-Availability Manager, I referred to the accompanying Application Notes for Microsoft Exchange Server 5.5. In addition to outlining BrightStor High-Availability Manager's configuration process, the instructions provide details about installing and configuring Exchange on the target server. This process involves shutting down the source server and temporarily renaming the target server with the name of the source server. I could then use the source server's identity to install and configure Exchange on the target server. After I installed Exchange and the servers reassumed their original identities, I configured the Exchange replication task just as I had configured the SQL Server task. The Appli-
cation Notes instructed me to perform minor edits to the provided Exchange script template, save it, and configure BrightStor High-Availability Manager to execute the script before and after failover and reinstatement on both servers. The process of configuring the target server isn't overly complex or time-consuming, but it requires your source server (which is probably a production machine) to be offline for a significant amount of time. The manual failover process worked quite well and required just less than a minute for the target server to become the active Exchange system. Mail clients running Outlook 2000 needed to restart that application to reestablish a connection to the mail server. The reinstatement process was also easy and problem-free and finished in about 4 minutes.
For Double-Take's Exchange failover tests, I downloaded the High Availability for Exchange Server documentation and associated scripts from NSI Software's Web site and used them to configure my servers. Similarly to BrightStor High-Availability Manager, Double-Take requires that you install Exchange on the target server while it impersonates the source server, so the source server must be offline during installation. Double-Take includes a chngname.exe utility that you can use for this impersonation and during failover. After the Exchange installation and configuration are complete, the process is virtually identical to the configuration of Double-Take for SQL Server fail- over. The failover and failback procedures occurred quickly and without incident. The transition was so smooth during failover and failback that the Outlook 2000 mail client didn't require a restart to maintain connectivity with Exchange.
Which Is Right for You?
Octopus was somewhat more difficult to configure than the other two products, but as with most Legato products, the complexities are typically a trade-off for power and flexibility. The alias capability is a definite advantage in simplifying failover and recovery procedures. Octopus's limited methods for automating failover might seem inadequate, but Legato offers other products that integrate with Octopus for greater control over high-availability installations. And if you need to replicate registry subkeys, Octopus appears to be your only option. This capability's usefulness was apparent in how easily I configured the Exchange installation on the target server without needing to change its identity.
BrightStor High-Availability Manager is the easiest product to configure and manage, and it operated as expected throughout all phases of my testing—which is fortunate because the software's documentation is more of a pamphlet than a manual. The automatic handling of the source server during failover and reinstatement greatly reduces the chance of operator errors during these operations. BrightStor High-Availability Manager provides the most robust options for specifying and detecting failure conditions that trigger a failover. However, the product doesn't include bandwidth-management functionality to throttle the amount of bandwidth consumed during replication operations.
Double-Take boasted the fastest failovers and failbacks, and its Failover Control Center offers more flexibility for managing failovers than the other tested products. Double-Take's transmission-limiting options make the product an excellent choice for bandwidth-challenged environments. I experienced a fairly steep learning curve with Double-Take—mostly because of getting accustomed to the UI—but I had no trouble zooming around in it after a couple of days. The documentation is thorough and well organized. If you're considering replication between different platforms, Double-Take is your best bet because it supports the widest variety of OSs.
All three of these products achieve the common goals of mirroring and replicating data while enabling failover to a secondary server. Each supports command-line operations, which permit scripting and triggering of events through a separate monitoring application. The products' extended features and operational nuances will define the suitability of each solution to your environment.