Fault tolerant software made easy

How would you like a software-based fault-tolerant clustering solution that is easy to set up and use and will work on almost any server (including servers you might already own)? Vinca StandbyServer for NT uses Windows NT's disk mirroring capabilities to mirror a hard disk partition on your primary server to your backup server over a high-speed link. If your primary server fails, the backup server takes over, using the mirrored data until you fix the problem with the primary system and bring it back online.

To run StandbyServer, you need two servers with similar capabilities, two Intel Pro100/B Ethernet controllers, and a free hard disk partition on the backup system that is the same size as the partition that you want to make fault-tolerant (i.e., your data partition). After you have all this equipment, setup is easy. You just pop the Intel Ethernet controllers into each server, connect the two systems via the provided cable, install the software, and mirror the fault-tolerant partition on your primary system to the backup partition on your secondary system using NT's Disk Administrator.

Vinca's interface is easy to use and runs on your standby system. As Screen 1 shows, the StandbyServer Manager main window gives you the system status. You can also easily turn on or off the failover capabilities as Screen 2 shows, fail over to the standby server, fail back to the primary system, or configure the system from StandbyServer's interface.

If you need to fail over an Internet server or software that requires a specific IP address, you can configure Vinca to fail over the primary server's IP address to the backup server. You can also run command (or batch) files when the system detects a problem with the primary server, when the system is failing over to the backup server, after the system fails over, and when it switches back to normal operation.

I tested StandbyServer on two IBM PC Server 704 systems (one 200MHz Pentium Pro, 128MB RAM, two Intel Pro/100B NICs, two Adaptec 2940W SCSI controllers) with Microsoft's SQL Server. Getting SQL Server set up properly can be tricky. You must install SQL Server on both systems and have it use the same drive letter for database storage. I installed SQL Server on the primary system and told it to use the mirrored hard disk (which I set up as drive V) to store its databases. I went to the StandbyServer Manager and failed over the primary system. In this mode, the standby system looked like the primary system and the mirrored partition was available with the same drive letter to the backup system. I then installed SQL Server on the standby system, again using the mirrored drive (drive V) for the databases. After SQL Server was set up on both systems, I brought the primary server back online and remirrored the two systems using NT's Disk Administrator.

On the standby system, I had to configure SQL Server to run only when the primary system fails. From the Services dialog box shown in Screen 3, I set the SQLExecutive and MSSQLServer services on the standby system to start up manually. From Options/Services in the StandbyServer Manager, I selected these two services to start when a failover occurs.

Now that everything was ready to go, I had to arm the system. From the StandbyServer Manager dialog box you see in Screen 4, I selected Armed. I now had a fully working fault-tolerant system running SQL Server.

The Moment of Truth
To initially test the configuration, I set up a command (or batch) file on two client machines. My test file queried the SQL Server sample pubs database and scrolled the output in a command window. This test helped me determine SQL Server's status: Scrolling text indicated SQL Server was working; no text meant no SQL Server.

I set up the following SQL command file on each client system:

:LOOP

ISQL -S VINCAA -U sa -P -Q "select * from pubs..employee"

GOTO LOOP

After I fired up this command approximately 12 times on each client, I shut off the primary server (that's right, I just shut it off) to simulate a system failure. Would the system fail over as it was supposed to, or would it just sit there? I expected it to just sit there. Well, the network link indicator on the main window turned red. Then, the Vinca link indicator turned red, and text started rolling by in the Status window. Vinca then tried to stop the remote procedure call (RPC) service for 6 or 7 minutes and failed over, just like it is supposed to.

OK, now I had to bring the primary server back online. So I turned on the primary server, thinking it would automatically fail back. The system booted up and spit out a few error messages saying that something had failed to start. So I looked in the event log to see what was going on and found that several services had failed to start. I decided to look in the manual to see what I did wrong. I discovered that the backup machine now thinks it's the primary system right down to its NetBIOS name. Guess what? NT doesn't like two systems on the same network having the same NetBIOS name, so it doesn't start certain services. To fail the system back to normal, I had to fail back the standby system and then bring up the primary system. This procedure means that the cluster is unavailable from the time the standby server has failed back until the primary server comes back online.

After I failed the system back to the primary server, I needed to remirror the hard disks to set the system back up, which you can do while users are logged on to the system. I fired up NT's Disk Administrator, selected the Vinca mirrored drive set, broke the mirror, then committed the changes. The system prompted me to reboot (to load a non-fault tolerant disk driver). But I used a trick to prevent Disk Administrator from rebooting. When I saw a dialog box with the message, "The changes you have made require that you reboot the system, press OK to reboot the system," I typed Ctrl+Alt+Del to bring up the Windows NT Security dialog box. I ran Task Manager (or you can type Alt+T), selected the Applications tab, selected Disk Administrator, and pressed End Task (or type Alt+E) to close the Disk Administrator. This procedure aborted the reboot. When a mirrored drive set is broken, NT leaves the drive with the most current data mapped to the original drive letter and maps the drive with the older data to the next available drive letter. This procedure ensures that any application accessing the hard disk gets the most current data. I ran Disk Administrator again, selected the drive with a different letter than the original shared drive (drive V, in my case), and deleted the partition. NT assigned the original drive letter to the partition with the current data and gave the partition with the obsolete data a different drive letter. Next I selected Commit changes now and was ready to remirror the system.

After playing around with these simple tests, I ran Bluecurve's Dynameasure benchmark package to perform a more thorough test. Dynameasure tests reading and writing to a large database and can simulate thousands of users. For this test, I configured Dynameasure to simulate 100 users. Vinca handled these tests well, although when hit hard, it took a long time to fail the standby server back to standby mode. In one test, this procedure took 20 minutes. (For more information about how I tested this clustering solution, see the sidebar, "Testing Wolfpack, LifeKeeper, StandbyServer for NT, and NT Cluster-in-a-Box.")

In a Perfect World
The Vinca clustering solution has only a few drawbacks. The first has to do with its design as an active/standby solution. You can connect users to the backup server, but if the system fails over, all the users connected to the backup system are disconnected and must reconnect using the primary server's ID instead of the backup server's ID. I would like to have (I know, I get a good product and I want more!) the product fail over from either machine, not drop all the users from the backup system when failing over, and remirror the systems more easily. Vinca plans to include these features in the next release, due out at the end of this month.

Overall this product is an easy-to-use, fault-tolerant solution that keeps downtime to a minimum. If your shop has or buys a second server just in case something happens to the first one or you want to make your site more fault tolerant without spending tons of money, Vinca is an ideal solution for you. Get a demo version from Vinca's Web site, and try it out.

Vinca StandbyServer
Vinca
801-223-3100 or 800-934-9530
Web: http://www.vinca.com
Price: $3995

IBM PC Servers
IBM
520-574-4600 or 800-426-3333
Web: http://www.us.pc.ibm.com/server