For the onetime Banyan Systems street fighter, Jim Allchin, the bugs don't fall far from the vine. Allchin, formerly the Microsoft senior vice president in charge of Windows 2000 (Win2K) development, and now Microsoft vice president of the Platforms Group, is a man on a mission: to make Win2K the most reliable OS available. (For information about Allchin's recent promotion and Microsoft's reorganization, see the sidebar "We All Change Places," page 44.) So, after more than 10 years at Microsoft working on Windows NT's early networking capabilities (and his earlier acceptance of good-enough development for OS products), Allchin's penchant for excellence and his determination might bear fruit.

In a speech at Comdex/Fall '99, Allchin said, "You're looking at the complaints department for Windows. I get letters about Windows 95, Windows 98, and NT. And I spend a lot of time going through those letters and feeling bad because, although I do get a lot of nice letters, most of the letters that hit my desk are the ones in which someone has had a bad experience. Two years ago, Microsoft set out on a path to figure out what was real about NT's reliability and compatibility. Because, at the same time that I was getting these letters, Microsoft knew that Dell.com was running NT, the Nasdaq Stock Market was running NT, and the Chicago Board of Trade was running NT. And these customers were incredibly happy with the reliability."

The quest to make Win2K reliable became a $160 million development effort that eventually involved 5000 hours of work. Allchin's team grabbed the logs of 5000 servers, collected terabytes of data, and began the process of analyzing the results to determine what factors were forcing NT to crash or require a reboot. Allchin continued, "In addition to the 5000 servers that we visited, we went to our OEMs \[Compaq, Hewlett-Packard, and IBM\], which handle support calls, primarily on the server, but also on the client; and we got their support data for the calls they were having." With the data in place, the team added many improvements to NT with Service Pack 4 (SP4) and SP5, and the team began work to fix the code for Win2K.

What did Allchin's team find in the data? First, administrators performed 65 percent of reboots because they thought restarting the machine in the middle of the night made the machine work better; the administrators planned these reboots. NT required the remaining 35 percent of reboots after the administrator added a service pack or option pack or changed the system configuration in some way, or when the system failed for no apparent reason. Allchin's team concentrated on the problems that caused this latter category of reboots.

In many instances, administrators used reboots to kill an application, such as an SAP solution, that spawned multiple processes. In 7 percent of all cases, configuration changes, such as a change to the TCP/IP stack, required the reboot. Administrators also rebooted in response to a reboot dialog box after installing a new application, although this reboot often wasn't necessary. In many cases, the data revealed that a new application replaced crucial DLL files that crashed the system. The blue screen of death resulted, causing 14 percent of all reboots. Device drivers and antivirus products were the worst offenders. A memory, disk, or processor failure caused a blue screen to appear 13 percent of the time.

Allchin's team had a multipronged plan to move forward. The development team set about to deliver better device-driver development tools because poor drivers were a core problem. This effort resulted in the creation of a driver verifier tool for testing. Microsoft added the Kill Process Tree utility to eliminate application-initiated reboots. And Win2K now makes the arbitrary replacement of DLLs much less likely. To combat security holes, Microsoft built a special team and hired outside analysts to attack the code. Then, the team built a stress test with more than 1500 machines and ran the equivalent of 3 months' runtime on each day's build. The team also tested machines against a 65 million-entry Active Directory (AD) and 2.3 billion DNS look-ups.

Microsoft recognized that the brute force bug-fixing approach wasn't working. So, Microsoft purchased another company that had a tool that analyzes source code such as memory leaks and improper variables. Allchin said his team ran this tool on NT sources, then fixed thousands of problems that the tool discovered. A systems development lab worked with software, such as virus software and the Novell NetWare redirector, to improve stability. The hardware lab qualified 9000 components for Win2K.

Win2K also went through a massive beta program in which numerous customers provided the team with feedback. The goal was to ensure the Win2K compatibility of the top 450 client-side and 75 server-side applications and to qualify 5000 different devices and 4000 different computers.

As the result of this work, the Win2K development team reduced the number of reboots from around 75 to 5 known cases. During installation, you'll notice that Windows 2000 Server (Win2K Server) requires 2 reboots, although NT Server 4.0 often required 10 reboots. Microsoft also found that publishing best practices for NT Server 4.0 helped decrease downtime by a factor of 5 for many customers.

Allchin said, "We feel very good about the quality \[of Win2K\]. I hope all of you have had an opportunity to play with it. If not, please, get some experience with it. The code is very close. I think you'll have a better experience than with whatever you're running today. You'll have a better experience, I'm sure." (For the full transcript of Allchin's presentation, see the Web site at http://www.microsoft.com/presspass/ exec/jim/11-15jimallcomdex.htm.)

And so, for the father of Win2K, this year might be a pretty good one. Allchin might receive fewer complaints and realize the deep satisfaction of completing one of the most successful OS launches in computer history. This achievement would be something to feel good about.