Troubleshooting your crashed NT systems

The use of Microsoft's command line x86 kernel debugger is commonly seen as a black art, both by experienced support professionals and new Windows NT users. When workstations or servers suffer a failure and display a blue screen, they generate a crash dump. Unfortunately, many users ignore or delete these crash dumps. However, with some basic preparation and knowledge, you can use the kernel debugger to yield valuable information on the systems' state at the time of failure. You can then correlate this information with the installed hardware, software, and other system parameters to help formulate a strategy for troubleshooting the system.

Although a full treatment of the kernel debugger and debugging tactics might fill several books, setting up the kernel debugger to debug a crash dump is not difficult. This article explains this process step by step and presents several specific command examples that demonstrate how you can use the debugger to extract useful information from crash dumps. I've also presented references to existing literature on kernel debugging for further research.

Kernel Debugging Basics
Debuggers let you inspect and troubleshoot program code as it runs. You can examine variables, registers, and stacks, and pinpoint problems by stepping line-by-line through a program. Some debuggers support source-level debugging by matching the developer’s source code (written in C, Basic, or another high-level language) with the corresponding machine instructions. This level of detail shows how the system compiled each line of source code and the exact effect of that code on the system. Other debuggers support only direct machine instruction or assembly language debugging. You typically use kernel debuggers to debug core OS components and drivers and use user-mode debuggers to debug applications and services. In a live debug session, a serial cable connects the target machine that you want to debug with a host machine that runs the debugger. Debug code running on both machines communicates commands and data via the serial ports. In a crash dump debug session, you analyze the crash dump file representing the complete contents of memory at the time of the crash offline after the failure has occurred.

Several debuggers are available for NT from Microsoft and third-party software vendors such as Compuware NuMega. Two well-known debuggers from Microsoft for use on the x86 platform are i386kd.exe and windbg.exe. I386kd.exe (available in the \support\debug\i386 directory of the NT 4.0 CD-ROM) is the command-line kernel debugger for x86 code. Windbg.exe (available separately from Microsoft) is the GUI version of i386kd.exe and can perform kernel-mode debugging and user-mode debugging. Each debugger executable interprets the register, stack, and instruction information for a particular processor architecture. For Alpha code, alphakd.exe is the equivalent of i386kd.exe. (This article refers to the term kernel debugger to mean the x86 i386kd.exe from NT 4.0.)

Microsoft and third-party software vendors sometimes request that customers submit crash dumps as compressed files for diagnostic purposes. In some instances, vendors request permission to dial in to a customer's site and engage in a live debug session. The vendors typically perform these sessions using i386kd.exe because they can easily export or pipe this tool to the host machine and then access the failed system remotely via the remote.exe utility, which is available from Microsoft Windows NT 4.0 Server Resource Kit. Even if you never debug your crash dumps, setting up the symbolic information ahead of time will speed up this debugging process.

Blue Screens and Crash Dumps
The blue screen of death is something every experienced NT support professional has seen. The sidebar, "Windows NT Kernel Debugging Resources," lists resources that explain why blue screens happen and how to interpret them. As a refresher, the blue screen indicates that the OS encountered an abnormal situation that it couldn't handle using normal error mechanisms. The OS consequently decided that it couldn't guarantee continued safe processing. Rather than risk corrupted data, NT provides a special internal function known as KeBugCheckEx(). The OS and device drivers use this function to halt the system when they find themselves in the previously described situation. After taking control of the system and placing the display into VGA 80x50 mode, this function generates all the information seen on the blue screen, such as the stop code/parameters, driver addresses, and stack data. The function also generates a crash dump, but only if you select the Write debugging information option on the Startup/Shutdown tab of the System applet in the NT 4.0 Control Panel, as Screen 1 shows. Assuming you properly size the paging file, the OS invokes the savedump.exe utility to write the contents of memory into the paging file and mark the location with a special code. Upon reboot, NT copies this part of the paging file to the filename specified, usually \%systemroot%\memory.dmp.

After the Savedump utility writes the contents of memory to disk, the OS displays a message to this effect and you can restart the system to restore operation and access the crash dump. I suggest you move the memory.dmp file from the crashed system to removable storage or another location on the network.

Creating a Symbol Tree
To interpret and debug an NT crash dump, you need the symbolic information for the OS on the crashed machine. Symbolic information consists mainly of mappings between memory addresses and human-readable variable and function names that the compiler generates when it compiles and links the NT source code. Symbol files, which are available on the NT CD-ROM and on Microsoft's FTP site, store the symbolic information. Every NT core file (e.g., .dll, .sys, .exe) has a corresponding symbol file with a .dbg extension.

Each NT build has its own symbol files, and each replacement system file in a service pack or hotfix (e.g., tcpip.sys) has a corresponding symbol file (e.g., tcpip.dbg). You must use the symbol files that correspond to the core files on a given system to debug a crash dump from that system. The debugger uses checksums in the symbol files to verify the proper versions. For example, if you install NT Workstation 4.0 build 1381 and Service Pack 3 (SP3) on a system, you need the symbol files from build 1381 and SP3 to set up a debug or crash dump analysis session for that system. If you then install several post-SP3 hotfixes, you also need the symbol files from each hotfix. However, if you add SP4, you no longer need the symbol files for the post-SP3 hotfixes, but you will need the SP4 symbol files.

The best way to effectively manage all these symbol files is to create a directory structure, a symbol tree, as Screen 2 shows. A symbol tree contains different subdirectories for the symbol files from each NT build, each service pack, and each set of hotfixes that you apply to your systems. To create the symbol tree (I use C:\symtree here, but the root of the symbol tree directory can be any valid directory name), you need the original NT symbol files, and all applicable service pack and applicable hotfix symbol files.

Original NT Symbol Files. You will need the original NT symbol files, specifically the \support\debug\i386\symbols directory from the NT CD-ROM. These symbol files correspond to the originally installed set of OS files. Use the expndsym.cmd command file in the \support\debug directory to install the symbol files onto a local or network drive as follows:

EXPNDSYM D: C:\symtree\basent4

This example assumes that your CD-ROM drive is D and that you want to install the symbol files into the C:\symtree\basent4 directory, which already exists. When this command file completes, it will create the directory C:\symtree\basent4\symbols with several subdirectories for each symbol type, such as .dll, .sys, and .exe. All the symbol files in each of these subdirectories will have .dbg file extensions. Do not rename the \symbols subdirectory because the debugger looks for that specific subdirectory name under each symbol group directory. For NT 4.0, the \basent4 directory will be about 96MB in size. The command file also creates a \system32 directory under the \basent4 directory. This directory holds the actual kernel debugger executables, including i386kd.exe and numerous support files.

Applicable Service Pack Symbol Files. For SP3 and SP4, the corresponding symbol files are available on the service pack CD-ROMs in the \support\debug subdirectory. On the SP3 CD-ROM, the files are uncompressed, and you simply copy them to the target directory (e.g., C:\symtree\sp3). On the SP4 CD-ROM, these files are in a self-extracting executable that prompts you to provide a target directory (e.g., C:\symtree\sp4) where the CD-ROM can uncompress these files. If the C:\symtree\sp4 directory doesn't exist, the self-extracting executable creates it and the C:\symtree\sp4\symbols subdirectory. The symbol files for both SP3 and SP4 are also available on the Microsoft Web site at http://support.microsoft.com/support/ntserver/content/servicepacks/default.asp. I recommend that you obtain and prepare both SP3 and SP4 symbol file sets so that all the relevant files are available when you're debugging crash dumps. The size of the x86 symbol file sets for SP3 and SP4 are about 70MB and 170MB, respectively, after extraction.

Applicable Hotfix Symbol Files. If you have one or more standard machine configurations (each having a known set of hotfixes installed), you can easily create a subdirectory under C:\symtree for each configuration. My company assigns version numbers (e.g., version 2.2, version 2.5) to workstation and server builds, so I’ve created example subdirectories (e.g., \hotfix22, \hotfix25). I create a \symbol subdirectory under each \hotfix directory to store the symbol files for each set of hotfixes. As I mentioned previously, you must create the relevant subdirectories first for each symbol type, such as .dll, .exe, and .sys. Then, you must copy the symbol files from all the corresponding hotfixes into the appropriate subdirectories based on the extension of the original files. Because some hotfixes overwrite files from previous hotfixes, copy the symbol files in the same order in which you install the hotfixes. You can obtain hotfixes and the associated symbol files from Microsoft's FTP site at ftp://ftp.microsoft.com/bussys/winnt/winnt-public/fixes/usa/nt40.

Strictly speaking, the only two symbol files that you must have to start the kernel debugger are ntnskrnl.dbg (along with ntkrnlmp.dbg for multiprocessor systems) and the symbol file for the appropriate hardware abstraction layer (HAL). However, the range of information that you can gather with only these two symbol files is limited.

Setting Up the Environment
Now that you've set up the symbol tree, the only remaining step is to start the kernel debugger i386kd.exe, which is in C:\symtree\ basent4\system32 on my system. The easiest method for starting the debugger is to use a command file that sets two necessary environment variables (_NT_Symbol_Path and NT_Debug_Log_File_Append) and then executes the debugger. Listing 1 shows an example command file, C:\symtree\debug.cmd.

_NT_Symbol_Path. The _NT_Symbol_Path variable holds one path or a set of paths separated by semicolons. The kernel debugger will search these paths for each symbol file it tries to load. The paths should all end with the word symbols (e.g., C:\symtrebasent4\symbols). The debugger will stop searching the paths after it finds the first occurrence of the symbol file, even if it is the incorrect version (determined via checksum). This environment variable lets you create the symbol tree with all possible symbol sets but only use the ones you need for a particular session. The example command file includes the symbol subdirectories for SP3 and the base NT 4.0 symbols.

_NT_Debug_Log_File_Append. The _NT_Debug_Log_File_Append variable contains a logging path and filename where the debugger writes all console output. If the specified file already exists, the debugger will append output to the file. Keeping a library of log files is a great way to build a reference of troubleshooting steps.

The last line in the command file starts the kernel debugger with three parameters (-v, -z, and %1). The –v parameter activates verbose mode when displaying the progress of symbol file loading. The –z parameter toggles crash dump mode (as opposed to a live kernel debug), and the %1 parameter holds the name of the previously obtained crash dump file.

Starting the Debugger
You're ready to execute debug.cmd and start the debugger. Be sure to specify the name of the crash dump file as the first argument. If you like, create a .pif file that executes the command file and changes the screen buffer height to about 500 lines (under the Layout tab). This modification is useful for scrolling back to see the output of a previous command. After the debugger starts, it will display the symbol search path from the environment variable, _NT_Symbol_Path, and read basic information from the crash dump. This information includes the address at which the kernel loaded, the kernel version number and type (checked or free), and the stop codes from the blue screen on the crashed system, as Screen 3 shows. Make sure that no checksum errors display as the debugger loads the kernel symbol files. If you see a checksum error during the loading of a symbol file, the most likely cause is that the symbol file is incorrect. The debugger will also list each kernel-mode code module that was in memory on the crashed system and its load address. Finally, the debugger will display the kernel subroutine that was at the top of the active thread’s stack (e.g., ntoskrnl!KiDispatchException+0x35e). When all these actions complete, the kernel debugger will display the kd> command prompt. The kernel debugger is now operating inside the crash dump, using the memory image of the crashed system as its operating environment.

The kd> prompt lets you enter commands and generates output. You can obtain a complete listing of kernel debugger commands by typing

!help

at the kd> prompt and hitting Enter. Many commands start with the exclamation point (known as bang commands). Some commands require arguments, such as memory addresses, that you must specify in hexadecimal format (e.g., 0x80100000). To abort the output of a long-running command, hit Ctrl-C. To exit the kernel debugger, use the q command. After exiting a debugging session, I recommend that you copy the log file (debug.log) to a separate location.

Getting Results
Now that you know how to execute the kernel debugger, let's put this knowledge to work with some basic commands in a sample troubleshooting scenario. Imagine that your company installed some new applications and hardware upgrades on one of your department's NT servers 2 months ago. Since then, the following cycle has repeated several times: The server runs properly for several days but then users connecting to resources start to complain about slower and slower performance. Each time, the server eventually stops responding altogether and a blue screen occurs. You’ve been coming in on weekends to reboot the server and head off the problem, but you haven’t had time to investigate further. You then configured the machine for crash dump generation so when the blue screen occurs again, the OS writes a memory.dmp file to disk. After you restart the server, you can use the kernel debugger to gather information on what happened, so you copy the memory.dmp file to the debugging workstation where you've already created a symbol tree.

After the kernel debugger loads your crash dump and is ready to accept commands, the first step is to verify that the symbol tree you created on your debug system is accurate. To verify the accuracy, type the command

!locks –p

at the kd> prompt. This command dumps kernel-mode resource locks and resource performance data, if it exists. In the process of performing this dump, the command also verifies the symbol file for every loaded module. You notice that all the symbol files are found and loaded properly, with no checksum errors.

Remembering that users were complaining of slower and slower performance, you decide to look at the virtual memory statistics at the time of the crash. One thing that can affect server performance is one process using large amounts of physical memory that previously belonged to other processes or the OS. To see if this is the case, you type in the command

!vm

at the kd> prompt. This command displays information on memory in use by system processes (Paged Pool and NonPaged Pool) and for each user-mode process, as Screen 4 shows. This information is helpful for identifying processes that may have been leaking memory for long periods. When you execute the command, it shows that a user-mode process called database.exe has been using 24MB of memory on this 64MB system.

Suspecting that maybe a 64MB system is insufficiently powered to serve as a database server, you wonder what other upgrades your company performed on this system. You type in

!drivers

at the kd> prompt to show base addresses, sizes, and link dates for all kernel-mode components, as Screen 5 shows. You can look in the output of this command for recently added drivers or applications. Upon inspection, this output lists a new driver that you haven’t seen before—tapedev.sys—along with its base load address, F8C00000. By now, the information from the initialization of the kernel debugger has scrolled off the top of the screen, so you redisplay the blue screen stop codes by typing

dd kibugcheckdata l5

at the kd> prompt. The dd command displays double word memory addresses, and kibugcheckdata is a symbol name that points to the location of the blue screen stop codes in memory. The stop code was 1E (Kmode_Exception_Not_Handled), and one of the codes (F8C01482) looks suspiciously like a memory address that might be inside the driver tapedev.sys. This clue points to the recently installed hardware upgrades as a possible source of the problem.

Next, you decide to look at what other software was running on the system. You type

!process

command at the kd> prompt to display the process that owned the thread that was executing at the time of the crash, as Screen 6 shows. This process is often a good clue as to what triggered the blue screen. Running the !process command shows that the database.exe process was active. Taking it a step further, you use the !process 0 0 command to display an abbreviated list of all processes that were in memory, as Screen 7 shows. Scrolling back in the window, this command shows that almost 70 processes were running, including some of the newly installed applications—yet another sign that the server is underpowered. The !process 0 7 command displays expanded information on each process, including all its threads, what routines each thread was executing, and the total CPU time used. Armed with this information, you contact the vendors of the database application and tape driver software that your company installed on the server and politely request that they perform some thorough compatibility testing.

Certainly, most problems and debugging sessions are not this simple, and usually a good knowledge of NT internals is a must to drill down and get the exact evidence you need. However, you can discover some information about the crash that normally would have been lost. And fortunately, Microsoft has written a utility to automate some of the process. Dumpexam.exe (also located in C:\symtree\basent4\system32) executes the above debugger commands and several others against a specified crash dump file. Like i386kd.exe, it requires the symbol tree. Execute it with the following command line:

DUMPEXAM –Y <symbolpath> <crashdumpfile>

Replace symbolpath with the value of your _NT_Symbol_Path variable from the debug.cmd command file, and replace crashdumpfile with the full path to the crash dump file. Dumpexam.exe will write the output to a file, memory.txt.

Wrapping Up
NT kernel debugging is a huge topic, but you don’t have to know every last detail to make use of the technology and accelerate your troubleshooting efforts. As this article demonstrates, simply knowing how to set up the kernel debugger and extract information about the status of a failed system can be of great benefit to you and to your hardware and software vendors.