Analyzing blue screens can save you repeated crashes or hours of reinstallation time

Windows 2000 indisputably brings a previously unknown level of reliability to Windows. Microsoft's rewrite of the core OS code to handle unusual situations, the company's enormous testing effort, and the new Driver Verifier tool mean that blue screens on Win2K systems are rare. However, many corporations still rely heavily on Windows NT 4.0. And although device drivers that ship with Win2K undergo comprehensive stress and correctness validation before receiving the stamp of approval from Microsoft's Windows Hardware Quality Labs (WHQL), undetected bugs can still surface. Further, if you install applications that contain nonhardware drivers, such as virus scanners, quota-management utilities, or encryption packages, your Win2K system might have drivers that haven't been through WHQL testing, even if you set the system's driver-signing policy to otherwise prevent untested drivers. Thus, although blue screens will be fewer, you might still see one from time to time, and having the information necessary to analyze them can mean the difference between spending a few minutes to uninstall one application and spending a few hours to perform a full OS reinstall.

Many systems administrators forgo exploring Win2K's and NT 4.0's crash dump options in the belief that using them is too difficult. Although Microsoft's debugger documentation has improved in the past year, it's still oriented toward device-driver developers. But even if just one crash dump in five contains information that proves useful, you'll find it worthwhile to learn at least a little about crash dump analysis.

This primer on crash dump analysis will ease the learning curve. I start with the basics of configuring a system to save a memory dump when the system crashes, describe where you can find the tools you need to examine a crash dump, then give you tips on gleaning information from a dump. Along the way, I introduce you to a continually evolving automated dump analysis tool, the Kernel Memory Space Analyzer (Kanalyze).

Enabling Crash Dumps
The first step in crash dump analysis is ensuring that when a system crashes, it produces a memory dump. You access the NT 4.0 crash dump options through the Control Panel System applet's Startup/Shutdown tab. Figure 1 shows the Startup/Shutdown page, in which you select the Write debugging information to check box and enter the name of the file you want to write the dump to. Other options on the page direct the system's behavior in response to a crash and include writing an event to the System log, sending an administrative alert, and automatically rebooting.

Because NT 4.0 crash dump files include a copy of the contents of a computer's physical memory, you need to ensure that your system has adequate disk space to save and store a dump. First, configure a paging file on the boot volume (the volume that contains the \winnt directory). The paging file needs to be large enough to store the system's memory plus 1MB. The volume that stores the dump file (which by default is also the boot volume) must have slightly more free space than the computer has physical memory.

These requirements derive from the way the kernel implements its crash dump facility. During the boot process, the OS checks the registry crash dump options in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl subkey. If one or more options are enabled, the system generates a map of the disk blocks that the boot volume's paging file occupies and saves the map in memory. The system also determines which disk device driver manages the boot volume and calculates a checksum of the driver's in-memory image and the data structures that must be intact for the driver to perform disk I/O. When a crash occurs, the kernel verifies the integrity of the paging file map, the disk driver, and the disk-driver control structures. If these structures are intact, the kernel invokes special disk-driver I/O functions that exist specifically for dumping memory when the system crashes. These I/O functions are self-contained and don't rely on any kernel services, because crash dump-related code must make no assumptions about which parts of the kernel or device drivers the situation that led to the crash might have compromised. The kernel writes the contents of memory to the paging file's sector map so that the kernel can avoid relying on file-system drivers. The kernel verifies the integrity of every component involved in the dump process before proceeding because writing directly to sectors on the disk could shred a disk's data if those sectors lie outside the paging file. A paging file must be 1MB larger than physical memory because when the kernel writes the dump, the kernel also writes a header that contains a crash dump signature and the values of several key kernel variables. Although the header is much smaller than 1MB, the system sizes a paging file by megabytes.

When a system boots, the Session Manager process (\winnt\system32\smss.exe) initializes the system's paging files by using the native NtCreatePagingFile function to create each file. NtCreatePagingFile determines whether the paging file it's initializing exists, and if so, whether the file has a dump header. When a dump header is present, NtCreatePagingFile returns a special code to Session Manager. As a result, when Session Manager executes the logon manager(\winnt\system32\winlogon.exe) to start the Winlogon process, Session Manager notifies Winlogon that a crash dump exists. Winlogon then executes the SaveDump application (\winnt\system32\savedump.exe), which examines the dump header to decide what crash response actions to perform. If the header indicates that a memory dump is present, SaveDump copies the contents of the paging file to the crash dump file you specified in the Startup/Shutdown dialog. While SaveDump writes the dump file, the system doesn't use the part of the paging file that contains the crash dump. During that time, the amount of virtual memory available for the system and applications reduces by the size of the dump, and dialog-box pop-ups might indicate that the system is low on virtual memory. After SaveDump runs, it informs the memory manager that it has finished saving the dump, and the memory manager makes available for general use the portion of the paging file that contains the dump. After saving a dump file, SaveDump performs other specified crash options, such as sending an administrative alert or writing an event to the System log.

The copy of the system's memory contents at the time of a crash often contains information that isn't useful for analyzing a crash dump. Because a crash results from a problem during kernel-mode execution, user-mode application data isn't generally relevant to crash diagnosis. Kernel-mode memory includes all OS and driver data structures, as well as executable code for device drivers and the kernel, so Win2K introduces a crash dump option that has the system save only kernel-mode memory. This option can significantly reduce the size of a crash dump file, making the file quicker to generate and copy and more practical to store and exchange with support personnel. A typical system with 128MB of memory might have only a 40MB kernel-memory dump. Figure 2 shows the Win2K Startup and Recovery crash-option dialog box, which you access by clicking Startup and Recovery on the System applet's Advanced tab.

Win2K also includes a minidump option. Minidumps, which the Startup and Recovery dialog box's Write Debugging Information drop-down list calls Small Memory Dumps, are 64KB crash dumps that store a minimal set of potentially useful information, such as the blue screen crash code, a list of loaded drivers, information about the process and thread being executed at crash time, and a snapshot of the crash point's stack (i.e., a history of recently called functions). The minidump data, which is essentially the same information that NT 4.0 displays on blue screens, sometimes contains sufficient information to guess at the cause of a crash. Minidumps are small and don't overwrite previous minidumps. A minidump's name has the form minimmddyy-nn.dmp, where mm, dd, and yy represent the month, day, and year, respectively, and nn is a unique number that distinguishes minidumps generated on the same day. By default, Win2K saves minidump files in the \%systemroot%\minidump directory. You analyze minidumps the same way you analyze full and kernel-only dumps. However, I recommend enabling kernel-memory dumps if you have the necessary disk space.

Reasons Crash Dumps Fail
Systems might fail to save a crash dump for a number of reasons. A system won't save a dump if the paging file on your boot volume is too small or if the volume on which you want to save the dump file doesn't have enough free space. In the latter case, you'll find a SaveDump record in the System log indicating that a dump wasn't saved.

More obscure reasons why a system might not save a dump include the possibility that a misbehaving driver corrupted the structures or code involved in saving the dump. In such cases, either the code fails to execute altogether or checksums of the disk device driver components identify changes and the kernel avoids possible disk corruption by not writing the dump. In addition, incompletely written disk drivers—which aren't uncommon on NT 4.0 systems—don't implement the special dump I/O routines that the dump code requires. (For more information, see "Related Reading," page 70.) All drivers that Microsoft digitally signs include crash dump support, so this problem won't occur on Win2K systems that have only signed drivers.

To test a system's ability to generate a crash dump, download the BSOD program from http://www.sysinternals.com/bluesave.htm and run it after waiting until your system appears idle for at least a minute. After you confirm that you want to crash your system, BSOD installs a device driver that allocates some kernel memory, frees it, then references the freed memory at a high interrupt request level (IRQL). Referencing freed memory and referencing memory at a high IRQL are illegal operations, so BSOD virtually guarantees a crash.

Analysis Tools
After you've configured your system to generate crash dumps and verified that it can do so successfully, you need to obtain crash dump analysis tools and associated data files. Most important, you must have available the symbol files for at least the kernel's ntoskrnl.exe file. Symbol files identify the names of internal functions and variables in the module to which they correspond, which can provide helpful information during crash dump analysis. If possible, you should obtain and install all the symbol files. Symbol files are service pack-specific, so make sure that the symbols you install are for your service-pack level.

You can find symbol files for the English version of NT 4.0 in the \bussys\winnt\winnt-public\fixes\usa\nt40 directory of Microsoft's anonymous ftp server at ftp://ftp.microsoft.com. (Symbols for other languages are in appropriate subdirectories under \bussys\winnt\winntpublic\fixes.) Symbols for the initial release of Win2K are on the Win2K Customer Support Diagnostics CD-ROM. When you insert this CD-ROM into the drive, a Web page opens and links to the symbol-file extraction tool. You can download Win2K Service Pack 1 (SP1) symbols from http://www.microsoft.com/ windows2000/downloads/recommended/sp1/debug/default.asp. The standard symbol installation directory is \winnt\symbols, but you can install symbols anywhere you want. To save your work later when you run analysis tools, define the environment variable _NT_SYMBOL_PATH to point to the top-level directory of your symbol installation (e.g., if you installed to \winnt\symbols, set the path to \winnt\symbols).

Next, you need to install the crash-analysis tools. Although you can find these debugging tools on the NT 4.0 Setup CD-ROM and the Win2K Customer Support Diagnostics CD-ROM, you should download the version posted at http://www.microsoft.com/ddk/debugging/installx86.htm because it reflects recent enhancements and bug fixes. I recommend you install the tools to a directory, such as C:\debuggers, that you can easily access from a command prompt.

Also download the OEM Support Tools from the Microsoft article "OEM Support Tools Phase 3 Service Release 2 Availability" (http://support.microsoft.com/support/kb/articles/q253/0/66.asp). These tools include useful add-ons to the basic debugging tools. The download is a Zip file, and I recommend that you unzip the tools to a different directory from the one you use for the other debugging tools. To read the documentation available for the OEM Support Tools, load the Install directory's starthere.htm file in a Web browser. Periodically check the OEM Support Tools and the debugging tools pages for updates.

Automatic Analysis with Kanalyze
After you've installed the symbols and tools, you're ready to perform crash dump analysis. Download and run BSOD, then generate a crash dump file that we can look at together. If you've performed a crash dump analysis in the past with the debugging tools from the NT 4.0 Setup CD-ROM or the Win2K Customer Support Diagnostics CD-ROM, you've probably used DumpChk to validate a dump file and DumpExam to generate a dump report. In the newer collections of debugging tools, DumpChk and DumpExam are obsolete. I start by describing their replacement, Kanalyze, a tool for automated crash dump analysis.

Kanalyze is a memory-dump analysis engine into which you load plug-in DLLs. You should know something about two of the several types of plug-ins that are available. One type locates and identifies items of interest in a dump; the other type analyzes the items that plug-ins of the first type found (some plug-ins fall into both categories). For example, some plug-ins locate and identify the memory locations of loaded drivers, allocated memory blocks, and I/O request packets. Corresponding analysis plug-ins ensure that the identified memory blocks account for all allocated memory, that loaded driver images don't differ from their on-disk versions, and that the I/O request packets are consistent. During the analysis phase, plug-ins note anomalies and conclusions (i.e., likely causes of a crash). Even during the item-location phase, however, Kanalyze notes any situation in which one item's memory area overlaps another item's memory area without being fully contained within it. For example, a driver's code resides entirely within an allocated memory block, so Kanalyze considers suspicious a situation in which driver code straddles two blocks or resides partially in an unallocated region.

The Kanalyze tool comes with documentation (accessible through the Kanalyze Help file) that lets third-party developers implement plug-in DLLs, but Kanalyze also bundles several Microsoft plug-in DLLs. For example, memory.dll identifies memory blocks, module.dll identifies driver code and data areas, and kobjects.dll examines a dump for kernel objects.

Another powerful Kanalyze feature is its ability to generate a signature ID file that provides important information about a crash and to store the file's data in a database. After completing an analysis, Kanalyze searches the database for other signature ID information that's similar to the information from the newly completed analysis. Thus, Kanalyze can detect crashes that result from the same cause so that you can identify trends or conclude that you might have implemented on another system a fix that you can apply to the system on which the latest crash took place. Kanalyze's database features require Microsoft SQL Server 7.0 or higher.

Because you install the OEM Support Tools by unzipping them, no menu shortcuts are available for running Kanalyze. Open a command prompt window, change directories to the directory in which you installed the OEM Support Tools, and type

Kanalyze

The Kanalyze wizard appears and guides you through the automated analysis process. The wizard requires you to specify the location of the memory dump you want to analyze and the location of the symbols. Unless you have SQL Server installed and want to use Kanalyze's crash database support, select the second radio button on the wizard's What would you like to do? page, which Figure 3, page 72, shows.

After you direct Kanalyze to the crash dump file, the wizard displays the crash dump's stop code and stop parameters (which Kanalyze calls BugCheck codes and parameters). A driver or kernel component that decides to crash the system uses the stop code to classify the reason that led to the decision. A crash you generate with the BSOD tool has a stop code of 0xD1 (DRIVER_IRQL_NOT_LESS_ OR_EQUAL) on Win2K and 0xA (IRQL_NOT_LESS_OR_EQUAL) on NT 4.0. Microsoft constantly updates its Knowledge Base to describe common causes of various stop codes and provide pointers to patches, driver updates, and workarounds. To find information about a particular stop code, type Stop and the stop code number in a search of the Knowledge Base. An example of the type of article a search might return is "Bugcheck 0x000000D1 Caused by DIc.sys" (http://support.microsoft.com/support/kb/articles/q266/2/21.asp), which explains how the Win2K Data Link Control (DLC) driver can cause a stop code of 0xD1 and directs you to a hotfix for the driver.

If you don't find a Knowledge Base article that matches your environment or crash scenario, continue to the next Kanalyze screen, which asks for the location of the symbols. Kanalyze loads the symbols for all the kernel modules it finds in the dump. This page informs you of missing symbols (third-party drivers don't usually include symbols) or symbols that don't match the loaded modules. Symbol mismatch warnings mean that the installed symbols might be outdated because of a service pack or hotfix installation.

The wizard's subsequent page tells you which plug-ins Kanalyze will load to perform the crash dump analysis. On the next screen, Kanalyze calls the DLLs in turn to locate items and analyze the resulting information. You can watch Kanalyze progress in phases. The wizard's \[KA_START_LOCATE_ITEMS\] phase reports the plug-ins that are looking for items in the crash dump, then the \[KA_PERFORM_ANALYSIS\] phase runs all the plug-ins that perform analysis. When the analysis phase is complete, Kanalyze waits for you to move to the final page of the wizard, which lets you view the analysis results.

When the View button in the Analysis conclusions area of the Results page isn't shaded, one or more plug-ins think that they have identified the cause of the crash. If you run Kanalyze on a dump that you used BSOD to generate, View is enabled, as Figure 4 shows. Clicking View displays a Namespace Browser window of identified problems, which Figure 5 shows. The window tells you that the STOPCODE plug-in thinks that the crashdd.sys driver produced the crash. The Namespace Browser window even shows you a stack trace (not visible in Figure 5) that tells you that the IopLoadUnload-Driver function in Ntoskrnl (the kernel) invoked a function in crashdd.sys and that crashdd.sys then in-voked KiTrap0E in Ntoskrnl. Whenever you see a function that contains the word Trap or Exception in a crash dump trace, you can bet that code in the kernel accessed an invalid pointer, crashing the system. In a BSOD-generated crash, crashdd.sys' access of invalid memory causes the trap function to be executed, so Kanalyze is right on the mark.

I don't recommend viewing the Anomalous conditions area of the wizard's Results page. Plug-ins conservatively identify as potentially unusual many situations that are not. The Results page also provides a View button for the Information from database area that lets you compare crash information with other information stored in a database. If you don't have SQL Server installed, you can't enable this functionality.

Advanced, the final button on the Results page, provides a view of detected items that you might use to do some manual analysis. Items are organized by type, and subitems reside underneath related items in the hierarchy that plug-ins define for their objects. For instance, the Module folder, which the Module plug-in generates, has subfolders for the memory regions that the driver and kernel code, data, and image header (which stores information about the composition of an image) occupy. The Process subfolder of the ExecutiveObject folder might be useful. Figure 6, page 74, shows this subfolder, which lists all the processes running on the system at the time of the crash and provides detailed information about each process' memory usage.

Manual Analysis with Kd
If Kanalyze fails to pinpoint the reason for a crash or at least provide useful hints, you can poke around the crash dump manually on the chance that you might spot something that Kanalyze missed. Two OEM Support Tools are available for manual analysis: WinDbg (often called Win Debug) and Kd (which earlier releases of the new debugging tools called i386kd). These tools have identical command sets and data-dumping capability, but WinDbg is a Windows application, whereas Kd is a command-line program. I recommend using WinDbg, which lets you easily copy values and use subwindows to simultaneously view more information.

To start WinDbg for crash dump analysis, type

windbg -z -y

at a command prompt. (If you've defined the _NT_SYMBOL_PATH variable, you can omit the ­y option.) WinDbg will run and present a view like the one that Figure 7, page 74, shows. You can now enter a number of debugging commands that will show you the state of various aspects of the system at the time of the crash. The debugging environment consists of three types of commands: built-in debugging commands, which have no prefix; dot commands, which have a dot (.) as a prefix; and bang commands, which have an exclamation point (!) prefix.

The most useful built-in debugging command is Dd, which dumps a range of memory. The dd esp command dumps what the stack-pointer register (aka the esp register) points at. However, unless you're familiar with x86 assembly language, esp dumps won't be useful. To access the online Help for the built-in debugging commands, use the ? command.

You can use the dot commands to load and unload debugger plug-in DLLs (also called debugger-extension DLLs) and control the behavior of a live debugging target. A live target is an operational system that you're actively debugging. Like the built-in commands, dot commands either don't facilitate crash dump analysis or they require advanced knowledge.

Debugger-extension DLLs implement the bang commands. WinDbg and Kd automatically load the kdextx86.dll basic kernel-debugging extension DLL, which provides commands that let you display information about various Win2K or NT kernel objects. Start with some initial data gathering by running the !process pid command. This command dumps information about the process that was being executed when the crash occurred. To obtain a complete list of processes, use the !process 0 0 command. The command !thread tid dumps data about the thread that was being executed, including its stack trace. Simply determining which process was running at the time of a crash might provide a useful clue to the crash's cause, and the stack trace might list a driver that was responsible for the crash. If you run !thread tid on a crash dump you generated with BSOD, you'll see a stack trace that identifies crashdd.sys.

If you see text such as TrapFrame @ 8013eee8 on the right side of the stack trace's line, run the .trap nnnn command, where nnnn is the hexadecimal number that appears after the ampersand in the text (8013eee8 in the sample text). Then, run the Kv command. WinDbg shows you the stack trace of a trap frame, which reflects the stack before a trap handler function took control. Although WinDbg isn't always able to display an accurate stack trace, when it does, the trap frame's stack trace reveals the actual trace that led to the crash. Do a Knowledge Base search for the names of any drivers you see in the stack trace on the chance that you've encountered a Microsoft-documented problem. Refer to the WinDbg Help for advanced tips about trying to determine a stack trace yourself.

The !drivers command dumps a list of load drivers that contains some of the same information that NT 4.0 presents on its blue screens. This command displays driver creation dates, which can alert you to out-of-date drivers. Check with vendors for updates to old drivers. One way to determine a driver's vendor is to view the properties of the driver file in Windows Explorer (most drivers are stored in the \winnt\system32\drivers directory); the version information includes the developer's copyright notice and sometimes a description of the driver.

Numerous other bang commands exist (the !help command provides a complete list), but I've presented those that you can use without advanced knowledge of Win2K or NT internals. The WinDbg Help file describes various options that the bang, dot, and built-in commands support.

Good Luck with Your New Knowledge
Despite Kanalyze's best effort, no magic wand exists that you can wave at every crash dump to precisely identify the cause. I hope I've provided some guidance that helps you extract from a crash dump information that you might not otherwise have obtained. As I wrote at the start of this article, spending a few minutes with Kanalyze or WinDbg might save you from repeated crashes or from spending hours reinstalling the OS. Thus, learning about these tools is worth your while even if they don't always help you.

Related Reading
"MEMORY.DMP File Not Created on Compaq DeskPro XL 566"
http://support.microsoft.com/support/kb/articles/q126/9/75.asp

"MEMORY.DMP File Not Created on Some NCR Computers"
http://support.microsoft.com/support/kb/articles/q136/3/76.asp

"No MEMORY.DMP File Created with RAM Above 1.7 GB"
http://support.microsoft.com/support/kb/articles/q173/2/77.asp

"Windows NT Does Not Save Memory Dump File After a Crash"
http://support.microsoft.com/support/kb/articles/q130/5/36.asp

"WinNT Fails to Create a Memory.dmp On Any Other LUN Than 0"
http://support.microsoft.com/support/kb/articles/q168/1/05.asp