Details are slowly emerging about the DNS problem that caused the massive cloud failure for all of Microsoft’s hosted services on September 8. According to a statement given to Arstechnica.com, Microsoft said:
“On Thursday, September 8th at approximately 8 p.m. PDT, Microsoft became aware of a Domain Name Service (DNS) problem causing service degradation for multiple cloud-based services. A tool that helps balance network traffic was being updated, and for a currently unknown reason, the update did not work correctly. As a result, the configuration was corrupted, which caused service disruption. Service restoration began at approximately 10:30 p.m. PDT, with full service restoration completed at approximately 11:30 p.m. PDT. We are continuing to review the incident.”
Hmmm… this isn’t good. DNS is hardly an unknown corner of the Internet world and has been around nearly 30 years. It’s therefore surprising that Microsoft has problems with its management. CIOs considering a transition from on-premises IT to cloud services might well ask why the tools used to manage essential pieces of the infrastructure are not more robust and tested to ensure that any malfunction does not have a huge impact on services.
But then again, any IT department can have a bad day. It’s just that millions suffer when the folks running the cloud services offered by Microsoft, Google, IBM, Yahoo, Amazon, or anyone else have a bad day, which then leads to the conclusion that the procedures in use to manage these services have to be automated and robust to remove the potential for error insofar as is possible.
I’m sure that Microsoft fully intends that this should be the case for June 28.
Last April, I published some notes from a keynote given at “The Experts Conference” (TEC) by Kevin Allison, the General Manager responsible for the development and support of Exchange for both the on-premises and cloud platforms. At that time, Kevin commented that Microsoft faced logistical problems in the building out of its cloud datacenters with the servers and other components necessary to support the hundreds of thousands of users who were moving into the cloud. Microsoft also suffered from a shortage of experienced personnel to help customers make the transition and I assume that this shortage extends into the datacenter as well. And then there’s the stress caused by the movement of tons of data from on-premises to cloud servers leading to a constant rebalancing of load across available servers within the datacenters. These comments related to Exchange Online but I hazard a guess that they might also apply to the whole of Office 365 and perhaps even to Azure, Hotmail, and whatever other cloud services share common pieces of the overall Microsoft cloud datacenter infrastructure.
Cynics who advise people never to deploy a new version of a Microsoft operating system or server application until the first service pack is available are probably chuckling merrily as they point out to anyone who will listen that cloud services are currently running the equivalent of a V1.0 infrastructure so why should you be surprised at the problems. All will be well when Service Pack 1 for Office 365 and a few hot fixes appear and have been installed.
I hope I’m not right and that Microsoft demonstrates that Office 365 is capable of running for months without a hitch. Otherwise they’ll be dipping into the corporate coffers to issue service credits on a frequent basis. As I observed previously, the August 17 outage, which only affected North American users of Exchange Online, might have cost Microsoft anything from $1.25 million upwards in service credit refunds. Given that the September 8 outage affected users worldwide, the 25% refund required by the Office 365 Service Level Agreement (SLA) will cost Microsoft at least double the previous amount, so that’s a minimum of $3.75 million in lost revenue for Office 365 in just three weeks. I don’t know if Microsoft has offered anything to Azure customers and of course, those who use Hotmail, SkyDrive, etc. have to experience one of the downsides of using a free service.
To Microsoft’s credit, they communicated with customers quickly to offer the refund. Curiously, they say that processing a refund might take up to 90 days – don’t they have a cloud service that they can use to process the data? On second thoughts, maybe some well-managed on-premises SQL servers will do the trick.
Follow Tony’s ramblings via Twitter.