Clustering and mirroring your Web servers for maximum uptime
WebDev - Windows NT Magazine

As Web master for Windows NT Magazine, I know that downtime is the absolute worst thing that can happen to a Web site. Several vendors have solutions to help prevent this problem. One such vendor, Valence Research, is developing a Web clustering solution, Convoy Cluster Software, that lets you balance your Web servers' load and make them fault tolerant. The product looked interesting and simple to implement, so I gave it a try.

In addition to setting up the Web cluster, I needed a way to make sure that both Web servers in our cluster were serving the same Web pages. Applications such as Octopus SASO can help you synchronize the information on both servers (for a review of SASO, see Carlos Bernal, "Octopus SASO 2.0," June 1997). However, I felt this product was overkill for data replication. After making a few inquiries, I decided to use Windows NT's directory replication.

Convoy Cluster
Convoy is simple to install and operate. If you follow the detailed directions, you can have a working cluster up and running in about 30 minutes. However, if you skip one vital step, such as I accidentally did, your Web servers will start playing ping-pong with blue screens of death. In this situation, one server covers for the other while it's down. Unfortunately, when the server that was down comes back up, it causes the other server to go down. This cycle will repeat indefinitely. Valence Research's technical staff was helpful in pinpointing the problem in the configuration I had set up. When I reinstalled and reconfigured the machines the second time, everything worked.

You can set up Convoy on machines with only one NIC. However, if you want the machines to be able to talk to each other so that you can duplicate information, you need to install two NICs in each Web server. I configured my environment using two NICs so that I could use NT's directory replication. Convoy refers to the two NICs as the dedicated adapter card and the cluster adapter card.

Installing Convoy
Although I can give you a general sense of how to install Convoy, make sure you follow the installation directions to the letter. You install Convoy as a new adapter. The installer adds the Convoy Virtual adapter and a Convoy Driver protocol to your system. After the installation is complete, the Convoy Setup screen, which you see in Screen 1, automatically opens so you can enter your Convoy clustering variables. You use this screen to type in your cluster IP number, each server's dedicated IP number, the priority status of each server in the cluster (the lower the number, the higher the status), and how you want to distribute the cluster.

The next step is to view the network bindings for all protocols in the Network applet of the NT Control Panel. While you're at this screen, you need to configure the bindings so that the Convoy Driver protocol can talk to the Convoy Virtual adapter and cluster adapter, but not to the dedicated adapter. You also need to configure the bindings so that TCP/IP can talk to the Convoy Virtual adapter and dedicated adapter, but not to the cluster adapter. For information on how to configure these bindings, refer to the Convoy documentation. In essence, you are creating a firewall because only Convoy knows how to talk directly to your machine via the Convoy Virtual adapter and cluster adapter. The outside world can't see or use the IP for your dedicated adapter.

How Convoy Performs
To test Convoy, I simulated 50 simultaneous users requesting HTML pages from the cluster IP. Right off the bat, I could see the two machines sharing the load. When I made a page request from the Web cluster, Convoy built some of the page from one server and the rest from the other. I was able to verify this load sharing because my two development machines didn't have the same version of Web pages when I started the test. I then increased the number of simultaneous users to 75, and the machines just kept purring. For reference, the first server is an Intergraph Web-300, 200MHz Pentium Pro with 128MB of RAM. The second is an Intergraph Web-300, 150MHz Pentium Pro with 64MB of RAM. In my environment, I couldn't create enough client requests to slow down these machines.

To provide fault tolerance, Convoy redirects incoming traffic to another server in the cluster when the software detects that the first server is not responding. To determine which servers are active, the clustered machines periodically exchange broadcast messages with each other. This communication lets each machine know the status of the other members in the cluster. When the status changes, such as when a server fails or leaves the cluster, Convoy invokes a convergence. In Convoy terms, a convergence is when the cluster reestablishes itself so that it can redistribute the load. Convoy invokes a convergence every time you add or remove a server from the cluster.

By default, each server broadcasts a message every second to monitor the status of the cluster. The cluster waits five seconds (five missed messages) before it initiates the convergence. The software takes another five seconds to redistribute the load, so the average failover time is 10 seconds. You can adjust these parameters as needed, but the default values work well without making the process too slow or overburdening the network. When I tested the fault tolerance, it worked every time. I could stop the Web service or shut down one of the servers in the cluster, and the remaining machine took over the entire load. Even with the default settings, my failover times were closer to 15 seconds. During that time, a Web server will experience a few failed connections, but these losses beat having to reboot or fire up another machine. Overall, I was pleased with the way the cluster performed.

However, when I stopped requesting simple HTML documents and started requesting data-driven pages, the picture changed. The Cold Fusion pages on our Web site didn't cause a big bottleneck, but our forums area did. The forums package we use, Allaire Forums, does some fantastic things; but it comes at a cost. Allaire Forums is a resource hog. Granted, most users who visit our forums don't go click crazy like my test did, but what better place to see load balancing?

Allaire Forums consists of a lot of Cold Fusion pages that make calls to the SQL Server back end. During this portion of my tests, Convoy stopped balancing the load between the clustered machines. Our SQL Server is an Intergraph InterServe 660 Quad 200MHz Pentium Pro with 512MB of RAM, so I knew that the machine wasn't the problem. The problem began when I simulated 20 users attacking the forums area. The Cold Fusion service that runs the forums choked on one of the machines. This lackluster performance is unfortunate, but even more unfortunate is that the other machine didn't take up the load. When the first machine was pegged at 100 percent CPU, the second machine was just idling at 20 percent. In this scenario, I would rather have seen both machines cruising or both pegged; at least I would have known that the cluster was truly load balancing in all instances. Cold Fusion appears to be the culprit in this test, but Convoy should have been able to cover for it.

Convoy's fault tolerance worked as advertised and at a recovery rate I could more than live with. However, I would like to see the software load balance in every situation. The product is still in beta, and I'm hoping that Valence can address this issue of load balancing certain types of pages, such as the data-driven pages in our forums, before the final release.

Directory Replication
When I started testing Valence Research's clustering solution, my two Web machines didn't have the same version of content. This situation is never ideal in a clustering environment, so I had to remedy it. I could have just dragged the root Web directory from one machine to the other, but this fix wouldn't address how I'd keep both machines mirrored in a working environment. I didn't want to have to remember to put the same file on both machines each time I work on one, so I needed a tool to automate this process of replicating the information.

NT's directory replication feature lets you maintain identical directories and files on different servers and workstations across domains. Maintaining identical data on separate machines is easy because only one master copy of the data exists, and all the computers synchronize their data from that master copy. The master copy is the export server, and all other computers are import servers. I have only two machines-- one export server and one import server-- although you can have multiple import servers. The export server can export only one directory tree, so I exported the entire Web root directory.

You can configure directory replication either to replicate changes to the import servers whenever you change any file or to wait for a two-minute stabilization period. I stuck with the default two-minute stabilization period.

To make everything work, you need to create a special user account for the Directory Replicator service to use. (Everything I've read about directory replication says that you must create this account. However, I wasn't able to get the Directory Replicator account to work without Domain Administrator privileges, so I simply used the Domain Administrator account to enable directory replication.) You can't use the name Replicator for the Directory Replicator account because NT already uses that name for a built-in domain group. To set up the special user account, you need to log on to the domain of the export server as a Domain Administrator. Next, you start the User Manager for Domains and choose Users, New User. When you see the dialog box for the new user, enter the values you see in Table 1. Choose Groups. The new account is already a member of Domain Users, but you need to add it to the Backup Operators and Replicator groups. Click Add to create the new Directory Replicator account, and click Close.

Next, go to Control Panel and select the Services applet. Highlight the Directory Replicator service and click Startup. Set the Startup Type to Automatic. Select Browse to the right of the This Account field to see the Add User dialog box. Highlight the Directory Replicator account you just created, click Add, and click OK. Enter the password information into the two password fields, and click OK to save. You'll see a message confirming that NT has set up your account to use directory replication at login. Start the Directory Replicator service by clicking Start from the Services applet in Control Panel.

Now that you've created the account, you need to set up directory replication on each Web machine. First you configure the export server. Select the Server applet from the Control Panel, choose Replication, and click Export Directories. You probably won't be replicating the default directory, so type in your Web machine's directory that contains the subdirectories and files you want to export. I specified c:\wwwroot, as you see in Screen 2. In the To List box, add the export machines, and click OK. You perform these steps for each import server, except you enter the import information instead of the export information.

After you finish setting up your import servers, you can test your setup. Copy a file into the directory you want to replicate on your export server. If you don't see the same file in the same directory on your import server within minutes, something is wrong. As I mentioned before, I had to change the login of the Directory Replicator account to Domain Administrator before I started seeing files replicated to the import server.

When you can confirm that NT is replicating your files to your import servers, you're finished. Now, anytime you make a change to a file in the export directory on your export server, the changes will automatically appear on each of your import servers.