It seems like every time Microsoft releases a new product, we all have to learn a long list of new acronyms. This is certainly true with Exchange Server 2007, which brings a slew of new features, some with jawbreaking names such as WebReady Document Viewing that cry out to have acronyms created. One new Exchange acronym that's drawing a lot of interest is CCR, the cluster continuous replication technology that lets clustered mailbox servers operate with no shared storage subsystem. It turns out that CCR is based on another, lesser-known technology: majority node set, or MNS, clustering.
If you've worked with clusters before, you know how important the integrity of the quorum resource is. The quorum is essentially a configuration database for the cluster. Each node in the cluster needs the ability to take ownership of the cluster and thus control what configuration data is changed and when. In a conventional Microsoft Cluster Server (MSCS) cluster, there's one quorum, owned by one node at a time. If the quorum resource is lost, or if a node can't access it, a variety of bad things can happen—including split-brain syndrome, where each of the remaining cluster nodes thinks it is the quorum owner.
MNS works differently from standard clustering because a copy of the quorum database is kept on each individual node. Changes to the quorum are only considered to be permanent if the change can be verified as committed to a majority of the MNS nodes. For example, in a four-node cluster, a change to the quorum will only be accepted if three of the nodes verify that the change was made to their local quorums.
CCR is based on MNS, but its implementation is a bit different from standard MNS clusters. MNS obviously requires more than two nodes in the cluster, so it turns out that you can actually implement CCR in two ways: using a three-node MNS cluster; or using two nodes and a third, uninvolved machine that acts sort of as an auxiliary quorum member. This machine is said to hold the file share witness (FSW) role—another acronym to learn! The FSW is a new feature introduced as a hotfix after Windows Server 2003 SP1; it essentially lets the quorum resource be copied to a computer that isn't part of a cluster, such as on a Hub Transport server in the same site as a CCR cluster. The cluster nodes can update, and read from, the FSW and use it as a third "vote" for getting and setting properties of the cluster.
What does this mean for CCR implementation? MNS is a commonly used technology for larger Exchange Server clusters, but CCR is limited to one active and one passive node. You can certainly build an MNS cluster using one active and two passive nodes, but there's not really much point in doing so for Exchange CCR (although Microsoft supports doing so.) The third node will essentially always remain passive, wasting a perfectly good piece of hardware.
Instead, plan on using a FSW on another machine. The FSW role is low-impact and can be configured on any computer, not just another Exchange server. If you're planning a geographically distributed CCR cluster, Microsoft recommends that you put the FSW in the same physical site as the node that will ordinarily be active. By doing so, you prevent a network failure between the two sites from shutting down the cluster. However, this configuration means that a failure of the primary site will require manual failover—the passive node won't be able to contact the FSW to reconstitute the cluster. There's a solution, which involves using a third site to host the FSW. I'll talk more about that next week.
End of Article


CCR (an Exchange feature) is great because the data is duplicated in near real time. File Share MNS (a Windows feature) is also great because you can eliminate the cost of the shared storage infrastructure.
However I think the Exchange team made a big mistake tying CCR to MNS. As long as you have a valid Windows cluster that will fail over (start the services, transfer the IP, etc) if the Exchange node fails, why should you care what kind of cluster it is?
Does Internet Explorer care whether I'm connected to the Internet via DSL or T-1? As long as the packets flow, of course not. Well designed software counts on the layers below without making assumptions.
I have a cluster for a SMB client that's a traditional shared-storage cluster, and I want to use it for Exchange 2007 with CCR (the Exchange data would be stored on large local disks on each node; the shared storage is used for a fault tolerant DFS share for a large file library). The large file library takes very little CPU/RAM load so this is a viable combination.
However, the Exchange folks have made CCR a "brittle" feature in that in makes unwarranted assumptions about what type of cluster is in use, so I can't install CCR in my non-MNS cluster configuration. (Actually with some trickery I installed it and it works fine, but I obviously don't want to put it into production unless its supported).
Why oh why are the Exchange folks (via the documentation and the BPA) forcing me to use the inferior SCC fault tolerance just because my cluster has some shared storage? I have the node-specific storage and functional Windows cluster that CCR needs, but they apparently want me to rip out the shared storage for some reason before they'll let me use CCR. Sigh.
Since it actually works, all they have to do to allow CCR fault tolerance on a shared-storage cluster, is to eliminate the pre-setup check . Am I the only person who cares about this? Any hope it will be fixed in a service pack?
mikejng June 07, 2007 (Article Rating: