Clustering seems to be a great idea—it promises superior availability and reliability for applications such as Exchange Server. However, I often hear administrators saying that they're "fighting the cluster monster." Does Exchange clustering truly work? To answer that question, we need to dig into the mechanics of clustering and better understand what it can and can't do.

First, understanding some basic Windows clustering principles is important. Each machine, or node, in a cluster can be in one of two states—active or passive—with respect to the cluster's application. For example, a two-node Exchange configuration can have either one active node and one passive node or two active nodes. Passive nodes are supposed to be ready to accept work from active nodes (i.e., fail over). The whole point of clustering is to be able to fail over users from one node to another without the users noticing. To that end, the Exchange and Outlook teams have devoted a great deal of effort to making Exchange fully able to use Windows' clustering features.

The nodes in a cluster share access to a set of storage devices through shared SCSI or Fibre Channel. You can divide the storage into logical volumes, each of which can be owned by one node at a time. The cluster uses a special volume called the quorum disk to log cluster configuration information and changes. The Windows cluster software can replay these changes when an offline node comes back online. The nodes in an Exchange cluster don't share data—two nodes can't write to the same mailbox database at the same time. Therefore, you must configure storage to give each node independent access to the Exchange stores and log files.

Administrators often have several complaints against clustering in general and against clustering Exchange in particular. The first is that clustering is expensive. No argument there: A low-end two-node cluster will set you back at least $15,000 for the two nodes and a storage unit (plus more for disks). You also have fewer hardware choices: You're restricted to systems listed in Microsoft's cluster Hardware Compatibility List (HCL). Don't even think about scrimping by clustering (possibly less expensive) hardware that isn't on this list. Although other systems might work, Microsoft won't support them and you'll be sorry in the end. (You can find the HCL at the URL below.)

The second objection is that clustering can actually increase downtime instead of reducing it. This complaint is also true, but its truth stems from a basic fact that has nothing to do with clustering: The most common point of failure is the administrator! Making a dumb mistake on one system is one thing. Making the same dumb mistake on a complex multinode cluster that supports thousands of users certainly increases the odds that the mistake will cause some damage. Everyone who has administrative or physical access to your Exchange clusters must understand how clusters work and how they differ from nonclustered systems. Although this necessity might require extra training, the increased uptime you'll gain from a properly administered cluster will more than repay the extra time and cost.

Aside from these general complaints, Exchange 2000 Server specifically comes under fire for an architectural decision that limits Exchange active/active clusters to a maximum of about 1900 Messaging API (MAPI) users per node. Active/passive clusters, however, have no such limitation (nor do systems that don't host MAPI clients), so Microsoft recommends using active/passive clusters. The anticlustering crowd asks how you can justify spending double the money for a two-node cluster that provides only one active node. The answer is simple: Clustering still provides a terrific way to perform rolling upgrades or maintenance—-planned or unplanned—-without interrupting users' work. Properly designed and maintained, Exchange clusters will indeed deliver the increased uptime that Microsoft promises. And although the recommended configuration for an Exchange Server 2003 (formerly code-named Titanium) cluster is still an active/passive "N+1" or "hot-standby" setup, an Exchange 2003 cluster (running on Windows Server 2003) can support as many as eight nodes, only one of which must be reserved as a failover target.

Next week, I'll delve into some design principles that you need to know to build a cost-effective Exchange cluster. In the meantime, I'm going to battle the real monster: the temptation to ignore my column deadlines in favor of the Xbox in my workroom.

Microsoft HCL