Microsoft's guide to high reliability

In early June, Microsoft released a 105-page white paper, "Microsoft Windows NT High Availability Operations Guide: Implementing Systems for Reliability and Availability" (http://www.microsoft.com/ ntserver/nts/deployment/planguide/highavail.asp). Microsoft based this document on the company's examination of nine customers who keep their IT enterprises running most of the time. The study's purpose was to determine which practices the companies use to make their systems highly reliable (i.e., to achieve n nines of reliability—99.9 percent, 99.99 percent, 99.999 percent, and so on) and disseminate that information so that others can implement similar measures and build more reliable networks.

Microsoft's white paper discusses how customers use operational procedures to achieve high reliability. Microsoft used the following problem-oriented classifications to categorize each customer's operational procedures: Planning and Design, Operations, Monitoring and Analysis, Help Desk, Recovery, and Root Cause Analysis.

The Planning and Design sections suggest detailed plans for reliability (how many nines do you want?), as well as service level agreements (SLAs) and standardized hardware and software. Interestingly, the document doesn't specify that you should use Microsoft software.

In the Operations sections, Microsoft covers basic principles. For example, you need to automate as many tasks as possible, and you need to pay attention to physical security—which is nonexistent at many sites.

The document's Monitoring and Analysis sections advise users to use a monitoring tool and to regularly review the data that the tool produces. This advice is sound because before-and-after data is important. Unfortunately, most clients who ask me to solve an NT problem didn't turn on Performance Monitor until after the problem appeared.

The Help Desk sections describe commonsense practices for running Help desk operations. For example, you need to train Help desk staff adequately, create well-defined escalation procedures, and ensure that the staff can access problem servers quickly.

In the Recovery sections, Microsoft offers more elementary but necessary advice. The company recommends that you maintain a supply of spare parts and replacement servers that are ready to install, use a cloning process to bring new systems online quickly, and establish and record recovery procedures. This last point is important because you need hard-copy access to your recovery procedures: I had a client who put the only copy of his disaster-recovery plan on his intranet in HTML, with no printed copies available. Although Microsoft doesn't suggest that you test your disaster-recovery plan, I recommend that you do. Untested recovery plans are little more than highly organized prayers.

The white paper's Root Cause Analysis sections suggest that your network's long-term health will be rosiest if you pay attention to root causes instead of merely treating symptoms. Although the document notes that all nine customers think root-cause analysis is important, only two of them perform this analysis. I find this information comforting—I'm pleased to know that the folks who run those annoyingly reliable networks are also human.

The Microsoft document doesn't offer many new ideas, but it's an interesting refresher and might be a decent learning tool for a novice network administrator. Oddly, although the paper's audience is NT administrators, four of the nine companies in Microsoft's case studies don't use NT for daily operations. The white paper explains that "because many basic best practices do not depend upon a specific operating system—for example, you can use similar backup procedures for both a UNIX-based server and a Windows NT-based server—Microsoft did not exclusively confine this study to environments that used only Windows NT-based systems." I'm disheartened that Microsoft didn't—or couldn't—find nine companies that know how to make NT highly reliable.