Improve the reliability of your messaging system

I have good news and bad news about Microsoft Exchange 2000 Server's clustering capabilities. The good news is that clustering in Exchange 2000 works better than it does in Exchange Server 5.5, and it's less expensive to implement. The bad news is that you still need to be aware of the technology's limitations. To decide whether clustering makes sense for your environment, you need to know what clustering is.

Understanding Exchange's clustering capabilities will help you decide whether clustering meets your business requirements for increased availability and maintainability.

Clustering Basics
Clustering has a distinct lingo and cool buzzwords. A word you've probably heard is failover. A failover occurs when a service running on a clustered node fails. The essence of the failover operation is that the failed service automatically restarts on a functional node in the cluster—without requiring you to do anything (and ideally without users noticing). Failback is the opposite of failover: When the failed node returns to service, the services that failed over to another node go back to the node they came from.

Other key terms to understand are the words that describe which nodes in a cluster are doing the work. Ideally, every node in the cluster is doing something; in Exchange, you want every node to simultaneously handle Exchange clients. Microsoft refers to such clusters as active/active—two nodes actively handling different clients. Active/active clustering stands in contrast with the Exchange 5.5 clustering model of one node talking to clients and the other node waiting for the first node to fail. This model is called active/passive clustering. If you have more than two nodes in a cluster and one of them is quietly awaiting a failure on another node, you have an N+1 node cluster.

Consider how users see a cluster—as a separate machine on the network. The cluster appears this way because clusters are built out of resources. A resource can be physical (e.g., a disk) or virtual (e.g., an IP address). If you throw together the right set of resources (e.g., an IP address, a NetBIOS name, and some Exchange services), you create an Exchange virtual server, which is an instance of Exchange that appears as a separate physical machine—even though it's not. Users can connect to this virtual server without regard to the server's physical location.

How many nodes make a cluster? Most Exchange administrators believe that clusters always contain two nodes. However, clusters on other OSs—and even on Windows, with non-Microsoft cluster software—can contain 16, 32, 64, or more nodes. Windows 2000 Server doesn't support clustering. Win2K Advanced Server supports two-node clusters, and Win2K Datacenter Server supports four-node clusters. Count on larger clusters from Microsoft in the future. Of course, Exchange supports as many nodes as the underlying clustering software does—as many as four if you run Exchange 2000 Service Pack 1 (SP1) or later on Datacenter.

Clustering and Exchange 2000
How does Exchange 2000 use clustering? The basic unit of failover for Exchange is the storage group (SG). When a node fails, all its SGs fail over to another node. This failover mechanism represents an interesting change from Exchange 5.5 clustering, in which the failover unit is a service, such as the Information Store (IS) or Message Transfer Agent (MTA). Exchange 2000's use of an SG as the failover unit simplifies the failover process: The IS on the receiving node simply needs to mount the SG and its databases—instead of requiring you to start the store and wait for all the logs to play back.

Two DLLs form the basis of Exchange 2000's cluster support. Excluadm.dll ties Exchange to the Windows cluster manager, and exres.dll ties the Exchange services and resources to the cluster service's resource manager. Of course, much more is going on beneath the surface. Each cluster-ready Exchange 2000 component must use the proper APIs and cluster interfaces. Notice my use of the term "cluster-ready." Not all Exchange 2000 services can benefit from clustering. The System Attendant, IS, Routing service, and SMTP service are all cluster-ready. The MTA is cluster-ready but only in active/passive mode; if the MTA fails on one node, you must restart it from scratch on the other node. Services that you can't cluster include the Network News Transfer Protocol (NNTP) server, the Instant Messaging service, the Active Directory Connector (ADC), the chat service, and the Key Management Service (KMS). When you're designing your clustering strategy, keep in mind that a failover might still leave you with lost capacity.

Practical Considerations
The most obvious benefit of clustering is that it can provide better service by minimizing the effect of failures. Because users connect to a virtual server—with a Messaging API (MAPI) profile or an Internet protocol client—when the underlying physical server goes offline, the client reconnects to the virtual server, now running on another box, and keeps working.

A second, less obvious benefit of clustering is that it lets you perform maintenance whenever you want. Consider the process of installing an Exchange service pack or updating your antivirus software: You must take down a production server, which means you need to perform the upgrade on Christmas Day (or another day when users won't scream about inaccessible email) or hurry through the process and hope that nothing goes wrong. By using Exchange clustering, you simply fail the Exchange virtual server over to another node and go about your business. Users continue to work as usual. After you finish your maintenance, you fail the node back to its original hardware.

Clustering has a few limitations that you need to be aware of as you plan. Originally, Microsoft didn't specify any firm limits for the number of concurrent users that active/active clusters can support. So, administrators tried stuffing as many users on a server as they could fit. If you have a two-node cluster, each node of which can handle 2000 concurrent users, one server will end up with 4000 concurrent users when a failover occurs—not a recipe for continued Exchange server availability. To help solve this problem (which is exacerbated by some internal Exchange architecture considerations), Microsoft now recommends that you use N+1 clusters (i.e., active/passive clusters for two-node setups) with a maximum of 1500 concurrent users per node. Using an N+1 design will prepare you for future releases of Exchange components that might include improved clustering features.

Clustering isn't a panacea. The primary cause of cluster failures isn't hardware or software—it's people. Clustering won't solve poor operational practices, such as failing to keep good backups, and it won't protect you from failures in your infrastructure, such as loss of power or Internet connectivity. But if you understand the underlying technologies and clustering's limitations, clustering can provide a more reliable Exchange experience for you and your users.