Planning for the unexpected in the post–9-11, post-Katrina age
IT disaster recovery planning is a brave new world these days. What with super storms slamming the coasts and terrorist attacks on US soil, IT experts have more than a full plate of potential scenarios to consider when preparing for disaster. Never mind the threat of Steven King–like pandemics such as Bird Flu, which have made the rounds in recent years, not to mention your garden variety fire or hard disk crash. And then there's the "media effect," which can turn a non-disaster such as Hurricane Rita into a full-blown IT cataclysm.
Add to this the fact that companies’ reliance on IT systems has never been greater. Everyone has a computer; some users have two or more. Many of us carry an additional computer on our belt in the form of a BlackBerry, other PDA, or smartphone. The complexity of large IT systems has increased to such a point that it’s hard for any one person to fully understand his or her company's hardware, software, and network configurations. Disaster planning and recovery of complex systems can involve managing teams of experts, including internal staff, third-party service bureaus, and system vendors.
Additionally, many IT systems are now distributed—including the people and the infrastructure. Disasters on the other side of the world can reach to your company and create havoc. For example, the recent outages due to undersea cables being cut in the Middle East affected many US companies. If you use overseas call centers, programmers, or even non-IT staff who require connectivity to your systems, you might have to be able to respond to global disasters.
All these factors require a new way of thinking to avoid your IT systems being taken out by a disaster. You have to make sure that your disaster plans have evolved along with the threats. If the nature of IT disasters has changed, your disaster plan needs to evolve too. Let's discuss some of these areas of emerging change and strategies for dealing with them.
9-11, Katrina, and Rita—Oh, My!
A lot of disaster planning is as it always was—making sure your backups happen, keeping them offsite, and so on. But some valuable lessons have been learned through hard knocks in the past few years. Some high-profile disasters have caused us to rethink some of the standard elements in an IT disaster recovery plan.
Let’s consider the disaster recovery lessons of the past decade and a half. The Oklahoma City and 1993 World Trade Center bombings taught us that whole sites could be taken out. This brought about a new emphasis on hot sites, data mirrors, and replication. Whole new technologies and mini-industries were spawned by these events. The September 11, 2001 terrorist attacks taught us not to count on cross-country travel in the event of a major disaster. In response, some firms changed their disaster recovery plans to require that recovery personnel be within driving distance or cross-trained employees so that trained personnel are in place in multiple locations. Hurricane Katrina in August 2005 showed us how a regional disaster can have effects far beyond the primary impact zone. After this, firms spread data centers out across the country or even internationally to mitigate the effects of a super-regional disaster. Both 9-11 and Katrina also showed the increasing effect of the media on disasters and disaster planning.
The media can exacerbate a real disaster—or even turn a “non-disaster” into a disaster. A good example of this is Hurricane Rita, the lesser known storm that hit Texas and Louisiana a few weeks after Katrina hit Louisiana, Mississippi, and Alabama. Rita was at one point a category 5 hurricane—the highest level, and the same intensity as Katrina. Early on, it was predicted to land near the Houston metropolitan area. Newscasts showed models that had half of Houston underwater. However, by the time Rita got anywhere near land, it had weakened considerably to a category 3 hurricane. This was still a bad storm but much less worrisome.
The storm directly caused seven deaths, but 100 more resulted from evacuation efforts and other indirect effects. The media was still whipped into a frenzy after Katrina and managed to blow Rita so out of proportion that it caused a massive evacuation, including from areas that would not have been affected even if Rita did land in Houston. It was the largest evacuation of a metro area ever attempted.
The roads became completely blocked for hundreds of miles in all directions. Cars ran out of gas as they idled in traffic jams and created more roadblocks. Supplies of essentials such as gas, water, and cash were depleted. The whole region became paralyzed, with the effects felt across Texas and into neighboring states. I was in Houston during this debacle, and people’s behavior was one of the scariest things I’ve ever seen.
All this for a non-disaster, at least as far as Houston companies were concerned. The Houston area had plenty of notice of a naturally occurring event that happens regularly in the vicinity, and the city experienced virtually no storm damage. The bottom line here is to remember to factor the media effect into your plans and be prepared to react without important people, supplies, and services for days or weeks, even after non-disasters.
Lessons to Live By
A lot of traditional disaster recovery plans depend on easy access to overnight shipping, cross-country travel, and locally available IT supplies. Many plans count on being able to FedEx new servers in or have people fly or drive to the recovery centers. But most delivery services were dead in the water during 9-11, Katrina, and Rita. 9-11 showed us how air travel can be suspended during times of crisis. Even putting disaster recovery sites within driving distance doesn’t always solve the problem. During the Rita scare, driving from Houston to Dallas, normally a 3-hour drive, took 15 to 20 hours or longer, assuming you could get gas. Many of my customers couldn’t get key people who lived across town into work. Your disaster recovery plan should take this into account, using those who live close by and not counting on those who might not be able to reach your primary or disaster recovery facilities.
Most businesses operate in a just-in-time (JIT) environment. They don't maintain large stocks of supplies or extra equipment; they expect anything they need to be available any time, in any quantity, 24 × 7. Most plans I see today count on large stocks of supplies and hardware being available in the local market and overnight shipping for crucial spares. I can’t tell you how many plans include the words “Go to the local computer store” for key replacements. This is fine when a disaster affects only your company. But if it's city- or regionwide, don’t count on your local big-box retailer being open, much less having what you need. If you want to use that generator, make sure you have a good amount of fuel for it onsite. Many of my clients found themselves with perfectly serviceable generators but no fuel to run them because fuel trucks couldn't break through the traffic jams during Rita. Luckily, most businesses didn’t need generators because the power stayed on. Remember that in a disaster or even a non-disaster, supply lines can be interrupted. Figure out ways to lessen your dependence on them for better disaster recovery results.
If an item is crucial enough that you can't do without it for a few days in a disaster, then you should keep a spare on hand. Some hardware vendors have programs that charge you a nominal cost for keeping a disaster recovery server boxed up at your site. The same goes for software vendors. Talk to them and see what they can do for you.
As I suggested above, another deficiency in many disaster recovery plans is their people element. Too many plans are long on the technical details of how to restore servers, services, and telecommunications but way short on detailed people plans. When a disaster strikes, employees will think of their family and health first and work last. Traffic congestion or lack of fuel could prevent employees from reaching offices. Access to entire areas might be closed by government officials. If the situation is dangerous or uncertain, don’t expect a full staff.
Don’t depend heavily on any one person. For example, make sure that more than one employee knows how to recover from your backup media. Consider backing up your media to remote sites hundreds of miles away (by using replication or snapshots). Or make sure you plan with your local offsite storage vendor in advance how to get tapes to your remote disaster recovery site. Also, you can develop remote working capabilities for employees (e.g., VPN, remote desktop) in advance to lessen the effects of transportation problems. Cross-train your technicians where possible so that they can pick up others' duties if needed. Make sure that procedures are fully documented (so someone else can perform them). These instructions should be detailed and written in language that someone with limited technical capabilities can follow. And although electronic documents are great, make sure you have a printed copy in a binder that key people know where to find. You can simulate reduced staffing conditions in a disaster recovery situation by performing a “tabletop test” (in which you throw various scenarios at your staff and see how they react) after randomly removing half the people from the room.
If people are the engine that makes your plan work, communication is the oil that makes the engine work smoothly. Keep in mind that in a large-scale disaster, the ability to communicate with employees could be reduced or eliminated. Regular training will help people know what they need to do without being told after a disaster occurs. Don’t count on cell phone communication to replace land-line calls; cell reception is the first thing to go during a disaster, especially one that involves high winds. Even without bad weather, high calling volume can clog and crash networks. Have an 800 number that employees can call to get key information, or use a Web site, ideally not stored at your primary site, for communications. You can also use text messages, which sometimes go through when voice calls don’t. Even if text messages aren't transmitted right away, they're usually queued to go out as soon as the network comes up.
When it comes to testing your disaster recovery plan, try to think unconventionally. Think like a disaster rather than an IT manager. Instead of trying to make your plan work during your tests, try to break it. Hold unscheduled disaster recovery tests to find out who is really ready. Take out a key element (e.g., supply chain, Internet) to find out how it affects the plan. Most disaster recovery plans are staid, dusty documents on a shelf. See if any of the rank and file have actually read it and know what to do when the time comes. A real disaster isn’t going to be forgiving, so you shouldn't either.
Spend the most time and money on the most likely disasters. These are things not generally classified as disasters, such as extended power outages, floods from plumbing leaks, theft or vandalism of equipment, and so forth. These things can take your systems out as fast and hard as any category 5 storm.
Here are some additional ways to disaster-proof your disaster recovery plans. Use automation wherever possible to speed the disaster recovery process. Manual call trees are better than nothing, but you can use software to automatically call everyone and deliver a prerecorded message. Most modern phone systems support this, and it lessens dependence on the human element. Put plans online as well as on shelves to make key information available to the people who need it no matter where they are. Software is available to do this, manage multiple plans, and allow for changes and updates. Just make sure the software and systems are secure because your disaster recovery plan could contain sensitive information. Allow for multiple communication channels (e.g., regular phone calls, Web, cell phone texting). And finally, test, test, and test some more! Doing all these things won’t guarantee you can recover quickly in a disaster, but it will greatly increase your odds of success.