Cloud services have been around for years but now are a real option to be the computing platform of choice for many applications. Early cloud services were focused on consumers and it was only after the launch of SalesForce.com that businesses, including major software and hardware vendors, began to focus on the task of transforming IT to accommodate both on-premises and cloud models. Recently we have seen an unfortunate storm of highly visible failures across an array of cloud services from multiple vendors that have caused people to wonder whether it’s wise to consider the transition of work now.
Consumer cloud services, especially those that are free, have much higher user tolerances when problems occur. If Picasa or Snapfish fail to upload a photo because of a network glitch, you’re likely to blame “the Internet” or some other reason and simply restart the upload. When a business cloud service suffers the same kind of failure (for example, a message cannot be sent or an attachment uploaded), user tolerance isn’t so obvious. Yet both consumer and business cloud services depend on the same loosely-coupled Internet that no one really manages, so why should we be surprised when glitches occur? With this in mind, let’s look at some of the recent cloud events to see whether something can be learned to be better prepared for the future.
A recent outage for Microsoft’s Business Online Productivity Services (BPOS) in August 2011 occurred when a transformer belonging to the national electricity network failed in Dublin. BPOS hasn’t had a great record for stability, so much so that Google was able to count some 113 instances of unplanned incidents during 2010. An unplanned incident covers a variety of sins but 113 in a year or roughly one every three days isn’t a track record to which anyone would aspire. It’s fair to say that BPOS ran software (such as Exchange 2007) in 2010 that was never designed to support the scale and complexity of cloud infrastructures. Older applications don’t function so well when asked to run in the cloud simply because they are usually designed to run inside the well-known boundaries of standard corporate deployments. It’s therefore a good thing to select applications that have been purpose-built for the cloud or those that have been re-engineered to support cloud infrastructures. SalesForce.com is an example of an application in the former category; the versions of SharePoint, Lync, and Exchange that run inare examples of the latter.
Amazon has a great record for its online stores and is also in the business of selling compute power to third parties. The same failure that affected BPOS also struck Amazon’s European Datacenter and caused three major services (Elastic Compute Cloud (EC2), Elastic Block Store (EBS), and Relational Database Service (RDS)) to go offline. In this instance, the backup generators in the datacenter failed to restore power and the services stayed down. Amazon’s cloud services provide the fundamental underpinning for the applications of many other companies. Amongst others, the failure brought Reddit, FourSquare, Quora, and Indaba Music crashing to a halt. The lesson here is perhaps that concentrating so much computing horsepower in a relatively small number of massive datacenters might be compared to putting all one’s eggs in a single basket.
In its short production lifetime since its June 28 launch Office 365 has experienced two very public outages totaling some 330 minutes in August and September 2011. These outages undermined Microsoft’s reputation for high-quality operations of a new service that is supposed to fix all of the problems that customers previously experienced with BPOS. The first problem affected Exchange Online users in North America because a network component failed and wasn’t backed up with redundant hardware. The second failure afflicted users across the world because a DNS configuration change went horribly wrong (the same problem affected Hotmail, Azure, and other Microsoft cloud services). Given the billions of dollars that Microsoft has invested in datacenters around the world and the software engineering to make their products cloud-ready, failures in basic operational disciplines are surprising and unwelcome. The most optimistic view is that these failures are merely growing pains and that Office 365 will prove itself to be a highly robust service over time. The good news is that Office 365 has survived a whole five weeks in production without a further outage so things could be on the mend. We shall see.
RIM BlackBerry is the latest cloud service to suffer a meltdown, experiencing four days of degraded service for customers that spread like a ripple across a pond after the failure of a “core Cisco switch” in the UK followed by a corruption of an Oracle database. Ever since its inception, RIM has exerted close control over its service through a set of Network Operations Centers (NOCs). The function of the NOCs is to handle message traffic from BlackBerry devices that are transported over mobile networks back to the NOCs where they are processed and diverted to their final destination. In this instance, the failure in a UK-based NOC was not handled by a failover to different hardware and the subsequent problems in processing messages due to the corrupt database caused a huge backlog to accumulate, in turn slowing the RIM network and delaying message delivery to users. This outage is similar to another experienced by RIM on April 17, 2007 when a software upgrade didn’t deliver the expected results and a backup system failed, leading to a huge backlog of email.
RIM has built its reputation on bulletproof messaging and losing service for up to four days just heaped woes on a company already struggling to refresh its device lineup and cope with the lack of success that their Playbook has had in the market.
Some commentators have reflected that RIM’s problems might be the result of a failure to invest in sufficient capacity to handle their recent success in convincing consumers to choose BlackBerry over other devices, often because of BBM, the inbuilt BlackBerry Messaging service. After all, if your friends have BBM you want it too and everything goes swimmingly until a problem comes along to disrupt service.
Now boasting availability figures of 99.96% in the first six months of 2011, Gmail is doing very well. Even so, Gmail did have a significant issue in February 2011 when some 20,000 users “lost” their inbox for a period. The Google Documents application has also had problems, the latest coming on September 9 when a software change exposed a memory management bug. In this case the outage only lasted 30 minutes before Google restored access for users to their documents. Generally Google has demonstrated the ability to deliver reliable cloud services, even if you might quibble with the user interface of their applications.
So what can we tell from all the fuss and bother that flows from cloud outages in an attempt to be better prepared for the future?
Handling over control for an application to a cloud provider makes exquisite sense for many companies and can return real benefits in terms of finance and operational efficiency. However, the old adage that one should “look before you leap” rings true and the boy scout “Be Prepared” motto is also valuable when plans are put in place to execute the transition and elevate into the cloud.