Understand the clues the blue screen provides
The color blue has become synonymous with disaster in the Windows NT world. Although NT is more reliable and stable than its cousins, Windows 3.x and Windows 95, it nevertheless is subject to the frailties of third-party software, add-on peripherals and their device drivers, and Microsoft's bugs. Almost everyone who has used NT for any length of time has seen a blue screen (also known as the blue screen of death). Screen 1, page 58, displays a typical
example. NT stops processing and paints one of these displays whenever it has
encountered a situation in which it cannot continue, or in which continuing may
lead to data corruption.
What most users and many developers don't know is what the screen's
information means. If you're lucky, simply resetting the computer will get you
on your way. If you're unlucky, you'll repeatedly get a blue screen every time
you start NT or perform a particular operation (e.g., inserting a new floppy).
Even if you've successfully moved past a blue screen with a reboot,
understanding the clues it provides can help you avoid future blue screens or
give you a hint about what driver or piece of hardware is causing problems.
This month, I'll talk about how NT generates blue screens, what leads to
their appearance, how to interpret the cryptic data NT lists on them, and how to
go about troubleshooting them. I'll tackle the topic from the perspective that
NT device drivers are not your forte and that debugging a blue screen with dump
analysis tools or a kernel-mode debugger is infeasible. In the process, I'll
describe the inner workings of NT's kernel mode. (For a different angle on blue
screens, see Mark Edmead, "The Blue Screen of Death," June 1997.)
NT Architecture Basics
To understand what leads to a blue screen, you first need to understand NT's
basic architecture. NT executes in two modes, user mode and kernel mode, as
shown in Figure 1, page 59. Kernel mode is a highly privileged processor mode,
with direct access to all hardware and memory; user mode is a less privileged
mode, with no direct access to hardware and restricted access to memory.
User mode is the mode in which applications and operating system
environment subsystems execute. The operating system environments that NT
supplies include POSIX, OS/2, Win16, DOS, and Win32. Applications are clients of
exactly one environment subsystem and use only the APIs that subsystem exports.
Thus, Win32 programs are clients of the Win32 subsystem and use only the Win32
API.
The subsystems use basic NT services that the NT Executive and the
Microkernel provide. These services run in kernel mode. The Executive includes
core operating system components: the Process Manager, Virtual Memory Manager,
I/O Manager, Local Procedure Call (LPC) Facility, Object Manager, and Security
Reference Monitor. The Executive is generally portable across processor
architectures (e.g., Alpha, x86), and it relies on the Microkernel for
processor-specific functions such as context-switching (scheduling) and
synchronization primitives.
Beneath the Microkernel resides the Hardware Abstraction Layer (HAL),
through which the Executive subsystems and the Microkernel interface with the
processor. Microsoft ships different HALs for different processors and processor
boards.
Device drivers are modules that interface NT and applications to specific
hardware devices. A large number of device drivers for disk drives, video cards,
modems, network cards, and input devices ship with NT. However, hardware vendors
can include custom device drivers with their hardware, and NT dynamically adds
the drivers to its kernel-mode environment.
User Mode vs. Kernel Mode
What differentiates user mode from kernel mode is the privilege level. A
program executing in user mode runs in a sandbox (not unlike a Java virtual
machine's sandbox) that the NT Executive and the program's operating system
environment create for the program. The sandbox enforces restrictions as to what
the program can do. One type of restriction relates to what parts of the
computer's memory the program can reference and in what ways.
Figure 2 shows the virtual memory map that NT creates for applications.
Addressable memory totals 4GB, but NT evenly divides the space between the
memory assigned to a program and the memory that the kernel-mode portion of NT
uses.
The lower 2GB mapping changes, depending on which program is currently
running. For example, if Microsoft Word is running, NT places Word's address
mapping in the lower 2GB; if Netscape Navigator runs next, its mapping replaces
Word's mapping.
The upper 2GB mapping always remains that of the Executive, Microkernel,
device drivers, and HAL. Thus, the split between user mode and kernel mode also
shows up in NT's address space mapping. (In NT Server 4.0, Enterprise Edition,
you can adjust the address split between user mode and kernel mode so that
applications have 3GB of memory, with 1GB left for NT's Executive, drivers, and
HAL. You will see this split only when NT is running on systems with several
gigabytes of physical memory.)
The primary memory restriction placed on user-mode programs is that they
cannot access any of the kernel-mode memory. User-mode programs also cannot
access invalid portions of their mapping (i.e., portions not filled with data or
code from the program). This arrangement contrasts with the kernel-mode portions
of NT, which have free rein over the entire address map. For example, NT does
not stop a device driver from writing data into Word's address map, but NT
prevents Word from writing over the device driver's image.
The user-mode sandbox enforces another restriction that limits a program's
ability to directly access hardware devices such as disks, the video screen, and
the printer. Programs must typically go through their operating system
environment (e.g., Win32) to read data from or write data to a peripheral. The
operating system environment then usually calls on the services of the Executive
in kernel mode, effectively forwarding the request. The Executive finally
completes the request, sometimes with the aid of a device driver, but almost
always with the use of functions in the HAL that interface with the computer's
hardware. NT implements the transition between user mode and kernel mode as a
system call gateway, through which the passage of data is precisely
controlled.