Ensure device driver reliability

Last month, I wrote about a company that switched from Windows NT to Linux because of NT Server reliability problems. To provide some balance, this month I address a major cause of reliability problems—faulty third-party device drivers.

Early users of NT constantly had to check the NT Hardware Compatibility List (HCL) to ensure that NT supported their hardware devices. They had to assume that a device would not work with NT. Today, NT is a commodity OS, and people assume that all hardware works with NT. If it doesn't, the fault is probably not NT's. In response to my May column, Mark Russinovich, author of our NT Internals column, wrote me:

NT supports thousands of hardware devices. Not only does Microsoft not write the drivers for NT hardware, but the developers writing drivers for NT often have no experience with NT drivers or internals. In addition, hardware companies are on accelerated Internet time, trying to get new devices to market before their competitors do. As a result, hardware companies put both developer training and driver testing on the fast track. Many devices therefore ship without developers' properly writing or adequately testing them. Microsoft endorses an HCL, which lists drivers that undergo certification by a testing lab, but this lab can't possibly test every driver combination. In addition, about 25 million NT systems are online, so even the most obscure bug in a vendor's driver will show up on a regular basis.

Contrast that situation with the Linux situation: Either Linux OS developers or other Linux gurus write drivers for this OS because they love learning its internals and contributing to Linux's acceptance. They have no deadlines for their drivers to be ready for shipment, and the community supports only a limited number of devices because of the limited pool of Linux hackers. Obviously, a disparity in quality will surface between the typical Linux device driver and the typical NT driver. Also, compared to NT, Linux has few device and software combinations, so latent Linux bugs have a smaller chance of surfacing.

NT's stability problems are therefore a byproduct of its widespread acceptance, not of fundamental flaws in NT. If Linux catches on to the extent NT has, Linux will certainly suffer the same trials. I grow increasingly frustrated with the media and the Linux community for ignoring common sense when they bash NT with the reliability cheap shot.

Enterprises are using NT Server for mission-critical applications from messaging to e-commerce. Major e-commerce sites such as Barnes & Noble, Walt Disney, eBay, United Airlines, Delta Air Lines, Dell, Compaq, Gateway, Intel, Microsoft, JCPenney, and CarPoint use NT Server to run their Web sites. Recognizing this trend, enterprise-system vendors—IBM, Unisys, Compaq, Data General, and HP—have committed to provide 99.9 percent uptime on NT Server 4.0. Each vendor is putting its reputation on the line if the customer will pay for the hardware, clustering, systems management software, and services necessary to guarantee such uptime.

How much reliability are customers willing to pay for? Stratus and Marathon Technologies have taken this challenge to an even higher level, providing redundancy for all network components. By guaranteeing up to 99.99 percent uptime, these vendors are reducing unscheduled downtime from 60 hours to 6 hours per year. Going to 99.999 percent uptime achieves less than 1 hour of unscheduled downtime per year. If customers will spend the money, vendors will spend the resources necessary to provide the reliability. The right combination of hardware, software, and service can make an NT system as reliable as you're willing to pay for.

So if drivers are the root of NT's reliability problems, how will Windows 2000 (Win2K) ensure driver reliability? Win2K introduces a driver validation tool, Driver Verifier. A developer will use the verifier to assure a driver's adherence to certain rules as a highly privileged component of the OS. The largest number of driver crashes result from drivers attempting to access pageable memory when the CPU is at an elevated interrupt priority. Such bugs are usually extremely difficult to find in testing, because a crash won't result if the pageable memory that the driver accesses is mapped into the system's physical memory. Developers don't often use memory stress testers during driver testing; but such testers don't necessarily force all the pageable driver code out of physical memory.

Driver Verifier will force all pageable system memory out of physical memory every time the driver being verified raises the interrupt priority. Thus, 100 percent of the time, Win2K will immediately catch an access to pageable data that violates the interrupt priority level rule. Such Driver Verifier features will prevent bad drivers from leaving the vendor's door. Microsoft might establish a testing lab that would exercise drivers via the verifier as a prerequisite to logo certification.

If hardware vendors apply Driver Verifier universally, it will profoundly affect the area that is most often the root of NT's reputation for unreliability: the device driver. We'll be checking Win2K for reliability. If Win2K can shake NT's reliability stigma, Microsoft wins. Otherwise, in the uptime arena, Microsoft will be chasing zeros instead of chasing nines.