The Exchange Online outage that affected some users in the Asia-Pacific (APAC) region on March 16 (March 15 in the U.S.) must be a candidate for the most under-reported cloud outage so far this year. Of course, I totally acknowledge that I jinxed commend Microsoft on six months of solid operation. I apologize to my friends in Redmond and promise to rub a rabbit’s foot before commenting on Office 365 operations in the future.by publishing an article the same day to
Getting back to the outage, it seems like the problem was DNS-related. A problem occurred in Microsoft’s Singapore datacenter that should have resulted in a smooth failover to servers in the Hong Kong datacenter. However, some DNS glitches got in the way and subscribers in Australia and New Zealand reported that they couldn’t access their Exchange Online mailbox – or indeed that their account didn’t have a mailbox any longer!
Tweets (once again the most reliable tracking mechanism for cloud outages) show that Microsoft reported that they were investigating the outage at 2:15PM (March 15, PST) and then that services had been restored at 3:36PM. Thus, the outage seems to have lasted about 80 minutes. Not a total disaster and certainly better than perhaps many on-premises IT teams might have handled service transition between two datacenters, but still not helpful to Office 365’s SLA performance when compared against its Google arch-rival.
Similar issues have caused problems for Office 365 before as DNS updates were the root cause of the September 8, 2011 outage. Anyone with knowledge of Exchange 2010 must suspect that the DNS problem here was switching the DNS records pointing to the Client Access Server (CAS) array used by APAC Exchange Online users to pick up servers in Hong Kong rather than Singapore. Microsoft hasn’t shared details of the outage with me nor have they released any details of their investigation, but the OWA error above tends to indicate that some problem occurred between CAS and mailbox servers.
All of this goes to prove that hiccups can occur in even the best-run service. It also demonstrates that Microsoft has some work to do to make datacenter transitions for Exchange 2010 more seamless and automatic than they currently are. The good news is that the pain that Microsoft feels in their online service will result in the dedication of engineering resources to make the problem go away (or at least lessen) to the eventual benefit users of both on-premises and cloud deployments.
It’s curious that the U.S. media that’s normally ultra-interested in all things Microsoft didn’t make more of this outage. Perhaps outages don’t count when they only affect users at the other end of the world?
Follow Tony’s ramblings via Twitter.