Now that calm and normal operations have resumed following the 190 minute outage on August 17, it's an appropriate time to reflect on the support experience enjoyed by companies and users alike. Although CIOs might appreciate Microsoft’s rapid offer of a 25% rebate on monthly fees in compensation, they will have been less impressed at the performance of the support ecosystem for Exchange Online. The rebate is actually the minimum credit called for by the Office 365 Service Level Agreement (SLA) and can be requested by customers once the service level delivered by Microsoft dips below 99.9%. In this case, Microsoft did the right thing by immediately accepting that the problem was theirs and extending the 25% credit without forcing customers to go through the bureaucracy of submitting a claim detailing details of the incident, number of affected users, and so on.
Getting back to support, the first thing to realize is that support is a difficult, difficult job. Don’t let anyone tell you otherwise as whoever thinks that support is easy has clearly never worked in the role. The delivery of good support depends on good people backed up by tools that help the support team detect, analyze, and rectify the problem as quickly as possible. Automation and knowledge bases play a huge part too as many support issues have been seen before and can be dealt with by automated processes.
Aside from the heavy lifting that goes on as support teams determine the root cause of a problem and how best to fix it, how communication occurs with users marks out really good support from the merely efficient. After all, if users know that the support team is working an issue and is confident that they will soon have the problem fixed they will be much happier than if fragmented snippets of misleading information are released on an ad-hoc basis. The actual problem might well be fixed in the same time, but the user experience will be so much better if users are kept as up-to-date as possible. Cynics will observe that users will only be truly happy once service is restored. There’s a fair amount of truth in that stance, but I like to think that humans react well to focused, precise, and accurate communication when systems that they depend on fail and they’re waiting for support to fix the problem so that they can go back to work.
Regretfully, the Office 365 support story wasn’t so good on August 17. First of all, Microsoft’s own records indicate that the initial network glitch was first seen around 11:30AM PDT but the Service Dashboard continued to show that everything was functioning smoothly until well past noon. This might have been because the network fault didn’t stop communications within Microsoft’s datacenter but users lost connections with both Outlook and Outlook Web App (OWA) well before Microsoft acknowledged that a problem existed by posting incident EX440 for Exchange Online. The party line was that Microsoft was “investigating”, which might have been a reasonable response had the incident been posted earlier rather than an hour after connections had started failing. The impression given was that a pile of headless chickens were running around looking for the root cause with little success. This is unfair as I am sure that a great deal of hard work was going on, but it’s the external-facing impression that counts when you discuss how good or bad the support experience is for customers.
Inevitably phone support lines suffered under the demand generated by thousands of users who wanted to know why they couldn’t get to their email. Those who could get through didn’t find out much more about the incident, possibly because the agents manning the phone lines used the same information that was reported on the Service Dashboard. In other words, stay tight while we investigate. Some users reported that they couldn’t get through to view the Service Dashboard and so were utterly in the dark without even the calming notion that someone was indeed investigating.
The tweets that the Office 365 team fired out weren’t too helpful either and were swamped by a flood of customer tweets complaining about the outage, Office 365, the Service Dashboard, and life in general. All parts of Microsoft appeared to be handicapped by a general lack of information about what might be happening until 2:21pm PDT when the Service Dashboard reported that the connectivity problem was to the North American datacenter and that Microsoft was working to resolve the issue as soon as possible. Although good to know that people were working the issue, users possibly expected a more definitive indication of what had happened and when service might be resumed some 111 minutes into the problem.
To be fair to Microsoft, their efforts eventually restored service and users began slowly to reconnect from 2:40pm PDT onwards. The Service Dashboard continued to have connectivity problems of its own and frustrated those who were responsible for administrating Office 365 for companies. Users couldn’t care less though because they could connect to send and receive email once more.
I’m sure that Microsoft has conducted a full post mortem on this incident to learn about how they can improve in every phase of the support cycle. As they work through what occurred and who did what to restore service, I hope that they pay attention to customer communications and make sure that components such as the Service Dashboard don’t fail. Office 365 is only at the start of its ramp-up and I suspect that millions of mailboxes will move into the cloud over the next year, with the bulk of them coming from companies that run Exchange 2003 on-premises or BPOS today. It would be nice to know that Microsoft will expand the capabilities of its support ecosystem in line with the growth in mailboxes.
On the upside, the SharePoint Online and Lync Online components of Office 365 did not suffer during the recent outage. Could this be because Exchange Online is the only application that anyone ever uses in Office 365? Nah, that couldn’t be the case…