We probably all know that Microsoft Exchange Server 2010 improved the high availability story for messaging environments through the introduction of the database availability group (DAG). This feature provides a simple, out-of-the-box method for Exchange admins to set up multiple copies of Mailbox server databases with replication and automatic failover. Sounds pretty great, and works great, too -- provided you set everything up correctly.
Exchange Server MVP Jim McBee presented a web seminar this week on Exchange high availability titled "Exchange 2010—99.999%" in which he discussed many of the common problems and mistakes he's encountered with setting up DAGs in Exchange 2010. I thought it would be useful to take a look at some of these problems and why McBee sees them as major stumbling blocks for high availability in Exchange environments.
Lack of understanding of quorum. Keep in mind that Exchange DAGs are based on Windows Failover Clustering and its quorum model to determine which member of the DAG is currently the active one. As McBee said, "A majority of the quorum has to always be up and running. Otherwise, the cluster service cannot permit the databases to continue to operate." Make sure you understand quorum and the voting procedure so you won't be surprised when failures occur. You can find a good description of the DAG quorum model in the Microsoft article "Understanding Database Availability Groups."
Not understanding alternate file share witness. You can establish an alternate file share witness through Exchange Management Shell (EMS). However, it doesn't automatically fill in for the primary file share witness -- that function, too, is only configurable manually via EMS when necessary.
Not load balancing effectively. Load balancing is an important aspect of Exchange 2010 deployments so that you maintain effective Client Access server connectivity for users. Beware of single points of failure in your architecture, which could render your DAGs moot. Ken St. Cyr takes an in-depth look at load balancing in "Exchange Server's Client Access: Load-Balancing Your Servers."
Treating a DAG like an active/passive cluster. This one overlaps with load balancing, and the message is to spread active clusters across servers to distribute the load. The temptation, perhaps, is to place all active databases on the same server, but there's no need to do so in a DAG configuration. McBee recommends learning about activation preferences: "You can configure the activation preferences of the database so that it's always got a preferred node or a preferred DAG member that it activates on, but you want to make sure that you distribute the load. This maximizes the use of your hardware, and it ensures if either node fails, that only half of the users have to fail over to the alternate server."
Documentation missing. I find this mistake to be perfectly understandable -- but no more acceptable for being so. Exchange admins and IT pros generally, I'm sure, are very hands-on type of people. You want to do the work; you want to build a working environment; you want to troubleshoot problems. But when you've done all that, you don't want to sit down and write an instruction manual about what you've done. Nonetheless, this simple step can easily save many hours of labor for you or others later on. As McBee said, "You need good documentation on what the processes are for managing these [systems], for what you would need to do if you ever have to move databases over so you can perform maintenance on a server or take it offline. You need to know exactly what would be necessary in order to take one server offline or to bring it back online."
Failing to complete the upgrade process. This problem might be of particular concern if you're upgrading from Exchange 2003 to Exchange 2010 because there are so many architectural and procedural changes. The advice here is don't stop just because you've got all your users on Exchange 2010 mailboxes. Make sure you address legacy issues such as address lists, the Offline Address Book (OAB), public folder replication, Exchange email address policies, and the like.
Failure to complete a sufficient pilot. I suspect your IT department will struggle against timelines set by others in the business as well as budgetary realities and be pressured into moving quickly through the test phase into implementation. However, it's much easier to fix a problem during a pilot operation than it will be after you've rolled out something to your whole user base in production. "For an organization that's got a couple thousand mailboxes, I would recommend that the pilot lasts four to six weeks" McBee said. "And during all pilot tests, you test all functions of Exchange 2010, including moving databases between servers; including ensuring that the Client Access array is functioning properly if one of the Client Access servers or Hub Transport servers is offline, and ensuring that databases fail over automatically. And possibly most importantly, ensuring that your backup and your recovery procedures are working properly prior to putting production users on the system."
These are just a few of the gotchas with Exchange high availability that McBee talked about. You can view the web seminar on demand to learn the other mistakes he's found in Exchange DAG deployments, as well as much other good advice about meeting your high availability goals in your Exchange environment. And you'll find more about setting up DAGs in your environment in Tony Redmond's "Exchange 2010: High Availability with DAGs" and Paul Robichaux's "Deploying Database Availability Groups in Exchange Server 2010."
So far, it doesn't look likewill change the DAG story in any major way. Of course, the underlying Windows Failover Clustering in can now support as many as 64 nodes, rather than the 16 nodes of Windows Server 2008, which set the limit at 16 members in a DAG for Exchange 2010. The question I have is how many organizations would run 64 members in a DAG if it were supported? How many are even running 16 now? Good thoughts for discussion.