Achieving the highest possible uptime with Exchange Server clustering is hard work. Installing a cluster isn't difficult. However, the hard work begins when you want to achieve a stable production environment that is properly configured, manageable, and ready for the worst situations.

Microsoft Cluster Server (MSCS), most of the hardware used for Windows NT clustering, and Exchange Server 5.5 are now mature products. Although in this article I can't provide all the steps for installing Exchange on an NT cluster, I give you guidelines, flag potential pitfalls, and suggest resources for good advice. If you follow these tips and do your homework, you'll improve the likelihood of a successful and highly available Exchange Server implementation.

The Evolution of NT Clustering
Since Microsoft officially released MSCS with NT Server, Enterprise Edition (NTS/E), the only updates for the product have been bug fixes and scalability improvements. In NT 4.0 Service Pack 4 (SP4) and SP5, Microsoft rewrote the fileshare resource DLL to handle automatic file sharing of directories under a root share.

In 1999, Microsoft launched its Exploring Windows Clustering Technologies Web site, which is a good starting point for NT clustering information. The sidebar "For More Clustering Information," page 10, includes the URL for this Web site and other sources of information about NT clustering. The Web-exclusive sidebar "Cluster Basics" gives an overview of NT clustering. (To access this sidebar, go to http://www.exchangeadmin.com, select the April 2000 issue, and then select this article.)

Planning and Testing
Clustering can significantly minimize the effects of hardware failures. However, to benefit from clustering's improvements, you must plan and test your solution carefully. When you're dealing with Exchange clusters, you must master five sets of technologies: NT Server, MSCS, server hardware, Exchange Server, and SCSI-based storage subsystems. If you're uncomfortable with hardware and storage systems, consider hiring an experienced technician for that part of the configuration. If you miss one point in any of these areas, you're heading for trouble. A good project plan must include at least the following steps:

  • Setting up the cluster and documenting the process (e.g., hardware; planning the sequence of events; organizing drivers, installation checklists, and service packs)
  • Testing cluster functions and documenting troubleshooting procedures (e.g., disk, cabling, private network, public network, switches)
  • Documenting installation of Exchange Server in a cluster
  • Testing and documenting manual failover
  • Testing and documenting failover (including client behavior) when using the Microsoft Windows NT 4.0 Resource Kit kill.exe utility to stop Exchange services
  • Testing and documenting procedures in the case of database corruption
  • Testing and documenting procedures with database restoration of the Information Store (IS) and especially the Directory Store
  • Operating procedures

Exchange Server 5.5 Clustering
Exchange Server uses an active/passive clustering model; that is, Exchange is active on only one node in a cluster at a time. (Exchange 2000 Server will support a four-node active/active configuration but not data sharing.) If you want to maximize your investment in hardware, you can use the passive node for other purposes (e.g. file sharing); however, in case of failover, each node must be able to handle the load from both file-sharing and Exchange services.

Exchange Server is cluster-aware primarily during the setup process. No custom resource DLL handles Exchange during operations; the generic service DLL that ships with MSCS handles all services. Furthermore, when Exchange initiates outbound connections, it uses the IP address of the physical node, not the IP address related to the Exchange alias (e.g., if the Internet Mail Service—IMS—is delivering email through a firewall, you must configure the IP addresses from both physical nodes for outbound connections through the firewall).

To install Exchange in a cluster, you need NTS/E and Exchange Server 5.5, Enterprise Edition(Exchange 5.5/E). So if you have two computers, you need two NTS/E licenses. According to Microsoft, you also need two Exchange 5.5/E licenses—one for each computer you install Exchange on.

Hardware
Evaluating your hardware and the technologies to use for clustering can be a lengthy process. The rules for choosing memory, disk sets, and other hardware are the same as for any Exchange Server installation. You can obtain more information about configuring Exchange Server in the TechNet articles "Deployment and Configuration Guide: MS Exchange Server on Compaq ProLiant Servers" (http://www.microsoft.com/technet/ exchange/technote/compaqgd.asp) and "Exchange—Deploy" (http://www.microsoft.com/technet/exchange/deploy.asp).

You need separate physical disks for IS and transaction logs. Disk I/O is the key performance factor in an Exchange Server installation. Therefore, if you need 70GB for the ISs and you've decided to use RAID 5, then nine 9GB disks or five 18GB disks perform far better than three 36GB disks. (A RAID 0+1 array is a much better performance alternative, if your budget allows the investment.)

Furthermore, NT clustering doesn't protect your system from failures in the shared storage subsystem that holds the IS. So take special precautions against failures on the storage subsystem. Technologies providing redundancy to the storage subsystem, such as a mirrored disk controller cache (a cache battery is a must), solutions using a dual connection path to the storage system (i.e., dual cabling and dual controllers), and other similar techniques are worth evaluating.

Choose servers and disk subsystems from the Microsoft Hardware Compatibility List (HCL) for clusters at http://microsoft.com/hcl. Only Microsoft-supported solutions are tested as a complete system, including two computers, a network, and disk subsystems. Make sure that your hardware supplier has access to personnel with clustering experience who can install your server and disk hardware.

Preinstallation Checklist
Before you begin the installation, make sure the following tasks are complete:

  • Check that the hardware has been properly installed. Consider using the Cluster Verification utility shipped in Supplement 4 of the NT 4.0 resource kit to test the hardware and SCSI configuration.
  • Ensure that you have the latest supported drivers and support software for your hardware.
  • If necessary, update the BIOS and firmware on machines, controllers, and disk systems. Talk with your reseller or hardware vendor to be sure that all hardware components support NT and clustering.
  • Obtain five static IP addresses from your network administrator. The two IP addresses for private communications must be on a separate subnet. The network administrator has the option of reusing these two IP addresses for other NT clusters because no routing exists between the private and the public network.
  • Decide the node and virtual names and register them in DNS. (You'll need at least four node names—two for the physical nodes, one for the cluster alias, and one for the Exchange alias.)
  • Create the cluster service account in the domain; MSCS will give the account the correct privileges.
  • Make sure you have available the media for NTS/E and Exchange 5.5/E and their service packs. (You need at least NT 4.0 SP5 and Exchange Server 5.5 SP2.)

Installing NTS/E
Installing NTS/E is straightforward. You just need more patience than usual because of the many reboots required. One key point is to follow the installation order. Two NT systems must not have access to a shared-disk system at the same time before you've installed MSCS on at least one of the nodes. Start the NT Server installation as usual, but when you're setting up the network cards, tell the NT setup process to locate both cards. And later when you're reviewing the bindings, as Screen 1 shows, select all adapters from the Show Bindings for drop-down list. Then, locate the network card used for private communications, and disable WINS. Also rearrange the priority so that NT uses the public network first.

When you're using a crossover cable as a private network between two nodes in a cluster, you must disable autodetect/configure on the Network Adapters tab and set it manually (e.g., to 100Mbit and half duplex).

When the NT installation finishes, it might tell you that one or more minor errors occurred, even though no error occurred. When the prompt asks you to install SP3; you can instead install your currently supported service pack. I recommend SP5 or later because of the many corrections to MSCS in SP4 and SP5.

Configuring the Disk System
Install any vendor-specific utilities for supporting the external disk system in a cluster (e.g., Compaq's StorageWorks Command Console) and configure the disks as appropriate. Then format (in NTFS) your shared disks as extended partitions and manually assign all your drive letters.

You must format your shared disks as extended partitions because of how NT 4.0 decides which disk is Disk 0. NT 4.0 assigns Disk 0 to the first SCSI controller on the primary PCI bus with the lowest PCI bus number. NT assigns the C drive to a disk that is either active or formatted as a primary partition. This configuration is usually fine when you initially install the system. However, when you later perform a total disaster recovery of a server with externally attached disks formatted as primary partitions, NT might insist on naming one of these external drives as the C drive. See the white paper "PCI Bus Numbering in a Microsoft Windows NT Environment" (http://www.compaq.com/support/ techpubs/whitepapers/ecg0240298.html) for further explanation.

Configuring NTS/E
In Control Panel, System, Startup, configure the nodes in the cluster to have different system startup times (e.g., 5 seconds and 35 seconds). This configuration avoids failures that can occur with regard to which server owns the quorum log, possibly resulting in lost delayed write errors in NT Event Viewer. The quorum log is the log for the quorum disk, a shared disk that manages the cluster.

If you have more than 256MB of RAM in your system, you need to change the size of the pagefile to at least the size of your physical memory, plus 12MB. Pagefile size defaults to about 265MB, even when you install NT on a server with 1GB of RAM. Be sure that you place the pagefile on the local hard disk.

Installing MSCS
To install MSCS, start NT Enterprise Edition Installer from Programs, Administrative Tools and select Install Microsoft Cluster Server. Form a new cluster, and type the name of the cluster alias. Type the cluster service account, and add the service account to the local administrator group. Select the drive for permanent cluster files, which is another expression for quorum disk.

Choose your network adapters for external (public) and internal (private) use, and prioritize the use for internal communications (i.e., choose private as primary and public as secondary). Finally, enter the IP address for the cluster alias. When the installation finishes, reinstall the NT service packs.

After the server restarts, log on and use the Cluster Administrator (from Programs, Administrative Tools or by typing cluadmin.exe at the command line) to check that all resources are available. Check the event log for any errors.

If everything is working in your one-node cluster configuration, you can install node 2. Leave node 1 running during this phase. Install NT as described earlier, but don't use NT Disk Administrator or attempt to access the shared disk drives at any time.

When the NT installation finishes, run Enterprise Edition Installer again, but this time choose Join an existing cluster. Type only the name of the cluster alias and the service account password; the program will obtain the omitted information from node 1. Install the service packs, and reboot.

Once again, check your configuration from Cluster Administrator, and check the event log for errors. Move all resources to node 1, and restart node 2. When node 2 is back online, check the Event Viewer on node 1 for errors. Timeouts or errors related to disk controllers signal a hardware-related problem (e.g., SCSI termination or SCSI resets). In addition, check that you can ping the cluster alias from the network. Also change the quorum log size to 4096KB by right-clicking the Exchange cluster alias and selecting Properties. On the Quorum tab, enter 4096 in the Reset quorum log at field, as Screen 2 shows.

Configuring the Resources for Exchange
Rename the default Disk group 1 (e.g., to Exchange), and move the disks you're going to use for Exchange to this group. Don't use the cluster group for anything except MSCS-related services (i.e., cluster IP address, cluster name, time service, and perhaps a script to synchronize time with the domain). Right-click the Exchange group, and create the IP address resource and network name resource for Exchange in the group. Make the network name resource dependent on the IP address. Bring the Exchange group online, and test that clients can reach the network name. Also, try to do a failover of the group and test again.

When failing over IP addresses, MSCS issues an Internet Engineering Task Force (IETF) standard Address Resolution Protocol (ARP) request, which updates routers and clients on the local subnet. If you have problems reaching the Exchange alias after a failover, check that all switches, routers, and clients are fully compliant with IETF Request for Comments (RFC) 826. Also, check for Virtual LAN (VLAN) switches that cache media access control (MAC) addresses but fail to update the MAC address when failing over an IP address.

Installing Exchange Server
Installing Exchange on a cluster is similar to installing Exchange on a standard server. Start the installation from the node that owns the Exchange group. The setup process detects that it's installing on a cluster, and you must provide the process with the name of the cluster group that Exchange will use. The rest of the process follows the standard Exchange installation procedure. If Exchange doesn't recognize your cluster resources, check Microsoft article "XADM: Setup Does Not Detect Cluster Resources Properly" (http://support.microsoft.com/support/ kb/articles/q184/8/80.asp) for troubleshooting suggestions.

Next, install Exchange service packs and proceed to the inactive node. Start the Exchange installation, select Update node, install service packs, and you've finished the installation. Now, you must start testing and documenting failure handling, disaster recovery, and operating procedures.

Migrating from a Standalone Server to a Cluster
Although you can upgrade NT 4.0 to NTS/E, you can't upgrade an Exchange server to a clustered Exchange server. However, you can install the Exchange cluster with a new Exchange server name into the same site as the existing server. Move the users from the old server to the new server, and then shut down the old server. (If this server was the first server you installed in the site, you must move the site folders before shutting the server down permanently. The Microsoft article "XADM: How to Remove the First Exchange Server in a Site"—http://support.microsoft.com/support/ kb/articles/q152/9/59.asp—explains the process.)

To ensure that your Outlook users know which server to use when they log on for the first time after the upgrade, create a network name alias that has the old network name but is bound to the IP address of the Exchange alias. When the clients connect to the old server name, they actually connect to the new Exchange server and automatically receive the information about the new server that holds their mailbox.

Troubleshooting
MSCS doesn't log much information in the Event Viewer. Exchange Server can fail over to another node without MSCS logging one event explaining what has happened and why the resources failed. However, you'll see Exchange Server startup messages in the Application log.

You can enable cluster logging by adding two variables to the NT system environment variables list. These variables are ClusterLog, which you give a value of path\cluster.log (create this path before enabling this variable), and ClusterLogLevel, which you give a value of 0 to 3, where 0 means log nothing and 3 means log everything.

In addition, you need to take special actions in certain circumstances (e.g., when the quorum disk has been corrupted or when you have to run CHKDSK on a shared disk). I recommend that you read the "MSCS Troubleshooting and Maintenance" white paper carefully for solutions to these kinds of problems.

Windows 2000 Enhancements
Microsoft has integrated the cluster setup with Windows 2000 (Win2K) setup, thereby simplifying the setup and notably reducing the need for reboots. In addition, Win2K incorporates some long-awaited features, such as better logging, support for Dfs root shares, DHCP/WINS failover, and network awareness. For example, NT 4.0 MSCS doesn't detect a failure in link state on the network card connected to the public network, but Win2K does. Win2K also introduces the Cluster Automation Server (CAS), which provides a set of COM management objects that you can use in scripts. You can also install and use a less feature-rich version of CAS on NT 4.0. Microsoft article "INFO: Where to Obtain Cluster Automation Server" (http://support.microsoft.com/support/kb/ articles/q245/6/56.asp) explains how to obtain CAS.

Achieving 99.9 Percent Uptime
As I've emphasized before, planning and testing are essential parts of a cluster project. Therefore, you must perform lab work to help you guarantee a highly available Exchange cluster. Try to test every possible error and situation that you can think of before going to production, and document how you fix the errors that you encounter.

The reliability of an Exchange cluster depends on the health of your shared disks. No amount of clustering will protect Exchange against a failure of the disks that hold the IS, Directory Store, or transaction logs. Take careful steps to protect these disks against failure.

An Exchange cluster is only part of the strategy you need to adopt to maintain high uptime. Other important parts include installing air-conditioning, having a UPS, ensuring physical security, educating systems administrators, and documenting operating procedures.