Complex solutions to IT problems are a wonderful rabbit hole in which system designers can hide away while they hatch their plans for grand deployments. However, most of the time the complexity adds very little if any value to an application like email, which by now is a utility application for many. So the question has to be asked, why do we all tend to implement more complexity in our solutions than we really need to? And can we stop? Microsoft seems to be doing a good job of eliminating complexity in Office 365. Perhaps this is something we should take on board?
One of the subtle but persistent topics taken up by many Microsoft speakers at the recent Microsoft Exchange Conference (MEC) was the need to eliminate complexity in implementation. Clearly a feeling exists within the denizens of Redmond that customers get their Exchange deployments wrong too often because they insist on over-complicating matters. There’s probably a fair degree of truth here as IT designers, administrators, and consultants do love to engineer complexity into solutions, which might be the reason why Microsoft has laid out its preferred architecture for Exchange 2013 in a well-reasoned EHLO post. The key message to take away is that simplicity matters. A lot.
Don’t get me wrong. Complexity has its place and there are times when a complex well-thought-out solution is absolutely the right answer. In fact, it can be a joy when the many moving parts of such a solution mesh together into a seamless whole to deliver the desired outcome.
But email is email. And at this point in its evolution, email is a utility application. As such, there’s a reasonable argument to be made that email servers should be designed as simplified components that slot together simply to provide the utility. In fact, the theme of simplification is written large all over HTTP's predominance has been emphasized by MAPI over HTTP), and introduction of the simplified DAG.in developments such as a reduction in server roles, a reduction in client protocols (
Remarks like “complexity breeds failure” (Ross Smith IV) and “humans are the biggest threat to recovery times” (Greg Thiel) were scattered across MEC sessions. There’s no doubt that Microsoft takes its own advice in the way thatis managed. I imagine that quite a few of the comments made at MEC flowed directly from experience of Office 365 operations. However, few companies have the people, development capability, investment, or automation skills necessary to deliver Exchange in the way that is done in Office 365, so the impact of some of these remarks might have been lost in a feeling that “ah, it’s OK for Microsoft to say that, but it could never work for us…”
Although Office 365 is in a category of its own, we shouldn’t ignore the learning that flows from its operations. I’ve already explored some of the other Office 365-influenced advice given at MEC, such as reducing the number of NICs on DAG member servers and increasing the use of lagged database copies, and I believe that cloud experience will continue to exert an influence over the architectural advice that you hear from Microsoft.
To counter the argument, it’s absolutely true that no customer deployment resembles Office 365 and that you cannot simply take anything that applies to Office 365 and accept it as the “way to do things.” For instance, Office 365 uses single-role Exchange 2013 servers whereas it makes much better sense for customers to run multi-role servers because of the better resilience and hardware utilization that can be achieved. And in fact, the "preferred architecture" explicitly calls out that it is based and the principle that "all servers are physical, multi-role servers." Keen observers will note the lack of fondness for virtualization here, the logic being that hypervisors add another layer of complexity for configuration, management, and operations.
Another difference that exists in Office 365 is the method used to apply patches and software upgrades. Servers are stripped literally to bare metal and then rebuilt from scratch, something that most IT shops couldn’t do because of the lack of automation. However, this is an imperative for Office 365 because of its scale and the number of servers to be managed. Imagine applying a hot fix to 100,000 servers. Just imagine.
Getting back to the point about simplification, the case for it as a guiding principle to architects was eloquently made by Boris Lokhvitsky, who uses his background as a math professor to simply point out that we can all chase our tails by attempting to use complex solutions to achieve high availability. The point was simple. Many individual components contribute to Exchange. If you want to achieve 99% uptime, then you have to consider how each component contributes. If your storage delivers 99% and the network the same, then you have a 98% overall availability. If you add server uptime to the mix and then add software, you drive down the overall availability unless these components can approach 100% uptime. In short, the more moving parts in a solution, the lower its overall availability is likely to be because more stuff can break. And, to get back to the point Greg Thiel made, the more stuff humans can mess up with.
Another math-based discussion of the problem can be found in the EHLO post "DAG: Beyond the A" (which has Ross Smith's name on it but was apparently authored by Boris Lokhvitsky). The topic is written about Exchange 2010 but the lessons can be taken forward to Exchange 2013. If you're not convinced after reading this epistle on on availability, you probably will never be. Or, perhaps like me, detailed math debate can lead to rapid eye closure and instant sleep.
The point therefore is to design simplicity into solutions whenever possible and avoid complexity at all costs. Simplify matters by using one kind of server instead of several, one kind of storage instead of a mix, one monitoring framework instead of many, one layout of Exchange components across servers (like databases always located on certain volumes), and so on. It’s hard to do this because we are all handicapped by the pressures of cost, time, and previous decisions, but it is an excellent principle to take forward and use as opportunities arise to improve your deployment.
Follow Tony @12Knocksinna