Trauma for Exchange 2013 servers when Managed Availability goes bad

The notion of a self-healing system is compelling. After all, wouldn't you like computer systems to take care of most of the mundane day-to-day tasks that can bog administrators down. And so we have Exchange 2013 Managed Availability, which is designed to monitor, assess, and rectify problems as they arise. Everything goes swimmingly until Managed Availability goes bad - then it becomes mangled availability because it can do some weird things, like causing BSODs for DAG member servers deployed in a multi-domain forest.

Managed Availability is one of the more interesting and perhaps compelling new features introduced in Exchange 2013. The idea is simple: to incorporate the ability to monitor, detect, and fix common problems that occur in a messaging system within the product so that it can, in a sense, take care of itself.

All new technology, even that which is extensively tested by being deployed as a fundamental part of the management framework used for Exchange Online, is prone to teething problems. Not all of the lessons extracted from the datacenter can be applied to the on-premises world and not all on-premises configurations can be replicated or tested within a massively scalable multi-tenant datacenter deployment as used by Office 365.

And so we come to some problems with Database Availability Groups reported by on-premises customers after the deployment of Exchange 2013 RTM CU2. Released on July 11, 2013 and then re-released (V2) on July 29 to fix a public folder permissions bug before running into the MS13-061 security update fiasco on August 14, those who have downloaded and installed the various kits and patches released by Microsoft might be forgiven if their faith in Microsoft’s testing processes has wavered just a tad. Microsoft responded by announcing that they would delay the release of Exchange 2013 RTM CU3“ to ensure that we have enough run time testing”.

But then the wheels seemed to come off the wagon when reports of DAG member servers experiencing regular BSODs started to circulate. To be fair, the problem had been reported well before CU2 was available but the pace of problems accelerated following the release of CU2 and CU2 V2. The problem only appears when Exchange is deployed inside multi-domain Active Directory forests. It's not clear if the problem occurs for standalone servers because no public reports have been filed to indicate that this might be so. I only run multi-role DAG member servers inside a single-domain forest myself, so I have not seen the issue.

Microsoft’s Scott Schnoll responded in the thread with a detailed description of how to disable the ActiveDirectoryConnectivityConfigDCRestart responder, a component of Managed Availability that handles problems that might occur in the connection between Exchange and the domain controller from which a server uses to retrieve configuration information. Exchange stores a lot of information about the organization, servers, and all manner of settings in the Microsoft Exchange container under Services in the Active Directory configuration naming context, a setup that has served Exchange well since it was first used in Exchange 2000. Active Directory is not the problem. As SCOM reports:

The AD Health Set has detected a problem with <2013 Server> at 8/22/2013 7:17:10 PM. The Health Manager is reporting that ActiveDirectoryConnectivityConfigDCProbe/Server Failed with Error message: Received a referral to <contoso.com> when requesting <abc.contoso.com> from <dc1.contoso.com>. 

This information tells us that the Health Manager service has detected that the ActiveDirectoryConnectivityConfigDCRestart probe has failed when it attempted to read configuration information from Active Directory. In this case, it seems like the probe failed for no good reason, leaving Exchange with an apparent problem to resolve. Not being able to retrieve accurate configuration data is a catastrophic problem for Exchange because it can lead to messages being routed to the wrong place and other myriad problems. Managed Availability therefore attempted to rectify the problem by invoking whatever actions are defined to cure such a situation and rebooted the server (in case of doubt, a nice server reboot clears everything out and starts afresh). Hence the BSODs.

The fix is to tell Managed Availability to use the Add-GlobalMonitoringOverride cmdlet to create an override the ActiveDirectoryConnectivityConfigDCRestart probe. This command does the trick for Exchange 2013 RTM CU2 by specifying that the override only applies to build number 15.0.712.24.

Add-GlobalMonitoringOverride -Identity Exchange\ActiveDirectoryConnectivityConfigDCServerReboot -ItemType Responder      -PropertyName Enabled -PropertyValue 0 -ApplyVersion "15.0.712.24"

Apparently Microsoft is working on a more permanent fix for the problem. Who knows... we might see it in Exchange 2013 RTM CU3.

[Update 19 Sept: The bug is formally described in KB2883203]

Some might ask why Microsoft’s commitment to deploy and use code in Office 365 before releasing it to customers didn’t catch a problem like this. The answer is simple. Customers run an on-premises environment where an Active Directory forest supports a single Exchange organization. Office 365 does not. Therefore the code path that caused the ActiveDirectoryConnectivityConfigDCRestart probe to misbehave in the way that it did was never exercised by Office 365. The question then is why Microsoft’s dogfood on-premises deployment didn’t catch the problem. We don’t have a good answer to that question right now.

Doubts persist as to the quality controls that surround how Microsoft releases new builds of Exchange to its paying on-premises customers. That is both sad and regrettable. Until Microsoft gets its quality under control, you should play safe and a) test any new code that is released to make sure that you have a good chance to detect any lurking problems and b) wait at least six weeks before deploying any new version of Exchange 2013 into production. Give someone else the chance to be the hero running software on the bleeding edge.

Follow Tony @12Knocksinna

Discuss this Blog Entry 19

on Aug 27, 2013

The whole 2013 saga shows unfortunately how immature this product is. Actually immature maybe not the perfect description. Looks like actual architecture is flawed making Exchange fragile. Unfortunately due to the way product was "designed" frankly I don't see an easy (if at all) way out, so as posted before be prepared for major outages if you have 2013.
Unless you have no choice stay with 2010 - even new deployments. Don't be a test platform.

on Aug 27, 2013

@Keruzam
I disagree with you, Exchange 2013 On-Premises is a great product, maybe the best Exchange so far with new features. We are only in RTM now we should give it a chance once SP1. You should always test in the lab 1st then go production and that goes with Exchange 2013 RTM CU2. We have Exchange 2013 RTM CU2 On-Premises in the production now and we are satisfied :-) 

on Aug 27, 2013

Patel

Great - it all depends on how much time and money you are willing to invest and what you need.
Given the track record of 2013 from release to date you will find camp divided as far as how great the product is. IMHO production should be running on 2010 and 2013 should be in test enviro until SP1.

on Aug 27, 2013

... plus if you have Office365 you can use Office for Mobile (iOS and Android) - both very nice apps.

on Aug 27, 2013

First of all, let me say that I don't think Exchange 2013 is flawed. It is as good as any software is - which means that some bugs exist, are becoming better known, and can be accommodated in any sensible deployment plan. So I think that you can deploy Exchange 2013 in perfect confidence provided that you put sufficient effort into the planning and deployment of your organization. That being said, I also think that Office 365 is a natural choice for many companies who a) don't specialize in email as an IT discipline and don't have the capacity, capability, or desire to upgrade their skills or b) have the required funding to finance the migration project. Many of these companies are in the small to medium (less than 5,000 seats) category. As always in life, you have to weigh up the good with the bad to make the best possible decision for your company. Exchange 2013 on-premises is certainly viable as in Office 365.

on Aug 28, 2013

Tony

"Fragile" ... a bit of an "air-gap" from other services and we would not have BSODs ... Frankly BSOD was not really due to a "flaw" in Exchange (I think) but due to "misunderstanding" between AD and Exchange. Now for a product to disable itself due to such an event - "fragile" as description is not out of place.
Exchange is relying on other services a bit too much, as such it is exposed to more than it's own problems. Good exercise would be to create "standalone" Exchange server and go from there. Impossible ??? Actually very hard but not impossible.

on Aug 29, 2013

@Keruzam, first of all the problem only happens for multi-domain forests. Second, you have to expect that issues will occur when new technology is introduced. This is the lesson learned from the past and it will be the same in the future. Third, the situation underlines once again the importance of testing new releases inside environments that mimic your production environment before deployment. I disagree that Exchange relies too much on other services. It has relied on Active Directory since 1999 and that connection has been pretty solid. This is a blip. Regrettable and shouldn't have happened, but just a blip.

And BTW, do you feel that you have to comment on every single one of my blog posts? I must really be connecting with you...

on Aug 29, 2013

The real reason for my replies is my hope that Microsoft is secretly reading my posts and will use my wisdom to improve their product. Seriously I find your blog informative and entertaining, as far as posting I have nothing to do on the train.

And I agree MA is a "killer" feature as currently implemented, luckily it impacts multi domain forest.
I am certain that only limited number of deployments have this config.

on Aug 27, 2013

Enterprise customers usually starts around 1000 seats and goes up. I know several Enterprise customers that have deployed Exchange 2013 RTM CU2 On-Premises and they are happy. Remember they also have test environment so everything is tested 1st before production deployment.

on Aug 27, 2013

The number that people regard as "enterprise" varies from country to country. Small countries regard smaller numbers as enterprise accounts. Larger countries such as the US put it in the 5,000-10,000 seat range. Your mileage might vary...

on Aug 27, 2013

Bottom line is Exchange 2013 On-Premises is a great product for Enterprise customers world-wide today. We will see many Enterprises moving to Exchange 2013 SP1 On-Premises once we have SP1 ;-)

on Aug 28, 2013

Patel

If not too much to ask ... can you name few killer features in 2013 vs. 2010?
As far as Office365 with all new and extra perks you will get (SkyDrive OCR), no escape really.

on Aug 28, 2013

I think Managed Availability is a killer feature in Exchange 2013. It has had some obvious issues but the potential is absolutely there. I also think that EAC offers a lot of value and that modern public folders is a solution for what has been a huge problem for Exchange in the past. Then there's the namespace simplification, DLP, DAG improvements, and so on...

on Aug 29, 2013

It is a great product undoubtedly.

on Sep 4, 2013

Tony,

This may sound strange, but I have to ask. We are currently on Exchange 2010 with Exchange 2013 CU2 coexisting. We ran into the BSOD issue and ran the script that Microsoft posted here:

http://social.technet.microsoft.com/Forums/exchange/en-US/44d1cd98-cba1-4ed0-b0e7-8aa76ee3eabc/bsod-after-creating-dag.

This may sound really strange, but I have to ask.... should we roll back to CU1 or should we wait until SP1 comes out? Do you know if there even will be a SP1? I'm a big Exchange fan and trying to get this into production as fast as possible.

on Sep 4, 2013

@illtill, I would persist with CU2 and upgrade to CU3 when it is available (reasonably soon). One hiccup is no reason to lose out on the other benefits in CU2.

on Sep 5, 2013

Tony,

Thanks for the advice and quick reply. I appreciate it.

on Sep 6, 2013

Just deployed CU2 this week, we did not see bugcheck with CU1 but we do with CU2, implemented the workaround in this article for now....product is definitely buggy!

on Sep 7, 2013

@steve710, isn't all software buggy? The trick is to know what parts are fragile and to then take proactive steps to offset any potential problems. The value of the community is that we discover these problems and share information so quickly, so even though you experienced the BSOD, you knew what to do about it.

Please or Register to post comments.

What's Tony Redmond's Exchange Unwashed Blog?

On-premises and cloud-based Microsoft Exchange Server and all the associated technology that runs alongside Microsoft's enterprise messaging server.

Contributors

Tony Redmond

Tony Redmond is a senior contributing editor for Windows IT Pro and the author of Microsoft Exchange Server 2010 Inside Out (Microsoft Press) and Microsoft Exchange Server 2013 Inside Out: Mailbox...
Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×