Find out what it has to tell you
You're working along just fine, and suddenly, your screen display changes from your nice
user interface to something that looks like Screen 1. You know what it is: a Windows NT kernel STOP
error, or the blue screen of death. So, what can you do? Often, the problem goes away when you
reboot the system. But what if it doesn't? What does that screen mean? Is it safe to continue using
the system? Let's look at what a kernel STOP error means, what can cause it, and most important,
what information you can get from the blue screen.
What Happens at the Kernel Level?
First, let's review the basics of the NT architecture. The NT operating system has two layers:
user mode and kernel mode. User mode is where the various subsystems--such as the Win32, POSIX, or
OS/2 subsystem--reside. Components in this mode provide the environments in which all user
applications run. For instance, Win32 programs run on the Win32 subsystem.
As you see in Figure 1, the kernel mode sits between the user mode and the physical layer (the
hardware) and prevents the user mode from directly accessing the hardware. The kernel mode also is
the home for the various NT executive services, such as the Object Manager, Security Reference
Monitor, and Process Manager. Just above the physical device hardware lies the hardware abstraction
layer (HAL) and above that is the NT microkernel. The HAL is the portion of the kernel that is
written in the specific platform assembly language. The microkernel is the heart of the OS that
takes care of all the NT internal OS operations.
An important component of executive services is the I/O Manager. Besides taking care of all
input and output for the operating system, the I/O Manager manages communications between drivers
and supports all file system drivers and hardware device drivers.
NT is a modular operating system; this fact means you can add DLLs or device drivers to add
capabilities to the system. You can, for instance, add fault tolerance to NT by adding device
drivers. When a peripheral manufacturer develops a driver for NT, the driver is most likely a kernel
mode driver: It resides in the kernel mode area and probably interfaces with Microsoft kernel
drivers. You can think of kernel drivers as the NT counterpart to Windows 3.1 or NT virtual device
drivers (VxDs). Kernel drivers are the low-level mechanisms for talking to the hardware. So when the
driver does something it's not supposed to, the error occurs at the lowest level and directly
affects the overall system and causes a kernel STOP error.
If an application operating in user mode does something to cause an error, NT halts the process
and generates an Illegal Operation error. Because every Win32 application has its own virtual
protected space, this error condition doesn't affect any other Win32 programs running. If the
application tries to directly access the hardware without going through the correct methods, NT
notices this and generates an exception error. A nice thing about NT is that it has good protection
systems for erratic applications.
When an application faults, you can close the offending program and resume work. Kernel error
conditions, however, typically are not recoverable; you have to reboot the system. You can think of
the kernel STOP as a built-in error-trapping mechanism. A kernel STOP error is NT's way of halting
further activity before the activity severely damages your system or corrupts data.
What Does This Weird Screen Mean?
OK, so what does this screen tell me? The kernel STOP may mean that a kernel driver--either a
system device driver or a third party driver--has illegally accessed the privileged kernel area. Or
the kernel STOP may mean that you have mixed SIMMs or added a bad network controller or SCSI
controller. In these cases, you can fix the problem by removing the offending hardware device. If
you have not added any new hardware, you need to get more information from the blue screen. Let's
look at each portion of Screen 1. Fortunately, you don't need to understand everything on the
screen.
At the top of the display is a hexadecimal value followed by four hex numbers in parentheses.
The first hex code is the kernel error code. With this error code, you can determine where the error
occurred, but not which driver caused the error. Table 1 lists the various error conditions. In our
example, the error condition is 0*0000000A, IRQL_NOT_LESS_OR_EQUAL. This code means that a process
attempted to access pageable memory at a process internal request level (IRQL) that was too high.
Microsoft Windows NT Server Resource Kit and Microsoft Windows NT Workstation Resource Kit have
complete listings of STOP codes.
The values in the parentheses give more specific information about what the driver was doing
when the error happened. The first value (00000000) points to the address that the driver referenced
improperly. The second value is the IRQL that was required to access the memory. The third value
specifies whether the driver was doing a read or a write. The fourth value points to the instruction
address that attempted the access. By looking at the STOP code and the third and fourth parameters,
you can possibly determine what caused the error condition.
The information in the middle of the screen, called the DLL base (starting with 80100000),
lists the drivers the system loaded and initialized successfully. The bottom of the screen, called
the DLL load base, shows the drivers in the stack. The first driver in the list is the next one to
be pushed from the stack, or executed. In many cases, the first driver is the offending driver. When
the base address of the first driver is close to the fourth value at the top of the screen (the
instruction address that attempted access), you can hypothesize that the driver might have caused
the problem when it was initializing and being pushed off the stack. In Screen 1, the number in the
DLL load base (000002fe) is very near to the fourth value (00000000) at the top of the screen.
Not all blue screens are easy to read. In this example, the problem driver might still be a
driver listed in the middle of the screen, even though the screen shows that it initialized
correctly, or the driver might not be on the screen at all (in the case of a bad controller card).
Or something other than a driver might have caused the problem. When you can't easily find the
problem, you need to go to the next step: debugging.
How to Debug
Let's assume that you have determined that the cause of the kernel STOP is an installed device
driver and not a hardware problem. Now what? Well, it depends. If you are not the developer of the
driver, you probably want to save the NT image information, and let someone else figure out what
happened. This approach is called noninteractive debugging. If you are the developer and you have
the source code, you can use the kernel debugger that NT provides to step through the driver code.
This technique is interactive debugging.
Noninteractive debugging. NT gives you the option of saving the image of the
operating system (at the time of the kernel STOP error) to your hard disk. You can use this
information to determine the cause of the problem. To save the NT image to disk, go to the Control
Panel, System applet (Screen 2, shows the NT 3.51 setup, and Screen 3, the NT
4.0 setup). You need to be an Administrator to access the options. You can write the event to the
system log so you can view the error in the Event Viewer. This option is handy, because if you set
your system to reboot automatically after a kernel STOP error, the condition may go unnoticed. You
also have the option to have the system send an administrative alert. This alert is useful, for
example, when the server has the kernel STOP error and you are working someplace where you can't see
the server screen.
The next option lets you write the memory dump file to %SystemRoot%\MEMORY.DMP. Note that the
size of the image file is roughly the size of your physical RAM. Therefore, if you have 128MB of
RAM, your dump file will be 128MB! You can select the option to overwrite the existing file, if one
already exists. The last option is to set the system to automatically reboot. If you elect to save
the image file and to reboot, the process may take a while, depending on RAM size. I have seen this
process take more than 20 minutes, so be patient.
You might also want to have the computer send an administrative alert. An alert is useful when
the system that has the problem is not near you, and you need to be informed when the error occurs.
You can configure administrative alerts in the Control Panel, Server applet.
The Windows NT Server and NT Workstation CD-ROMs contain some tools to help you with this
memory file. dumpflop.exe writes the memory file to floppies (a 32MB memory file fits on about 10
disks). Unfortunately, Microsoft does not accept the memory file on any other medium. Once you have
created the dump file, you can make it available to a Microsoft Product Support Specialist either by
sending the floppies to Microsoft or by preparing a Remote Access Service (RAS) connection for
Microsoft Product Support to dial in and view the file contents remotely. Or you can submit the file
to Microsoft over the Internet by connecting to ftp.microsoft.com and copying the file to
/transfer/incoming/bussys/winnt.
You can use another utility, dumpchk.exe, to examine the integrity of the dump file and verify
that the system created the file correctly. With dumpchk, you can view basic information about the
dump file, such as which NT version was running and the STOP error codes.
Another useful utility is dumpexam.exe, which converts the memory file into a readable text
file. You need three files to run dumpexam: dumpexam.exe, imagehlp.dll, and for the Intel platform,
kdextx86.dll (the third file depends on the platform). The three files must be in the same
directory. You can find them on the CD-ROM of the NT Server or the NT Workstation CD-ROM in the
directory \support\debug\<platform>, where platform is i386, alpha, mips, or ppc.
The noninteractive debugging method is ideal for users who don't want to debug the driver, but
just want to figure out which one is at fault. To run dumpexam, you need to load the symbol files,
which contain NT system debugging information. Make sure that the symbol files are for the version
of NT you're running, including any installed service packs. For the Intel version of NT, the symbol
files are in the \support\debug\i386\symbols directory on the NT resource kits' CD-ROMs
Figure 2 shows the syntax for dumpexam. For example, if you want to analyze a dump file for a
computer with NT Workstation 4.0, the symbols are in the directory d:\symbols. The dump file,
server.dmp, is in the directory d:\dump. The command line reads
dumpexam -y d:\symbols d:\dump\server.dmp
The results of the exam will be in %SystemRoot%\MEMORY.TXT.
Interactive debugging. The other method of debugging is interactive debugging.
Device driver developers, rather than systems administrators, usually prefer interactive debugging
because the process requires extensive knowledge of NT internals.
Interactive debugging requires you to have another PC (a host machine) with NT installed and to
run the kernel debugger on the host machine. It must be running the same version of NT as the target
machine. The host machine must be connected to the problem computer via a modem or null cable
connection.
Is the System Safe?
Can you safely use the system after a kernel STOP error? The answer depends on whether you can
isolate what caused the problem. I've seen cases where the error condition happens once, never to
repeat again. In other cases, however, the error occurs after the user has installed or updated a
driver. In this case, you need to remove the driver and start over. When you get the kernel STOP
error, reboot the system, and hit the space bar when you see the Last Known Good text. This action
starts NT with the last known working configuration, without the offending driver. The option of
reverting to the last known working configuration reinforces the wisdom of installing one driver at
a time and making sure that the driver works before you install another driver. If the driver
doesn't work correctly, you can revert to the previous working configuration. If you install two or
more drivers at the same time and one of them causes a problem, you will have trouble determining
which driver caused the problem.
If the problem is not related to a driver, look at new system hardware, such as a new
controller card. To determine whether a controller card is the problem, remove the card and test the
system again. If the problem goes away, check whether the card (or any new hardware you add to your
system) is in the Microsoft Hardware Compatibility List (HCL). TechNet and Microsoft's Web site
(http://www.microsoft.com) have an up-to-date list.
The Bottom Line
I hope that, after reading this article, the blue screen won't intimidate you. Using the
techniques I've explained, you can find out, in general, why you got the screen and perhaps tell
specifically what caused the problem. If you aren't a device driver developer and don't want to deal
with interactive system debugging, noninteractive is the way to go. Your goal is to get your system
up and running as fast as possible. Isolating the problem is the first step. For additional
resources, see the sidebar, "Other Sources of Help."
Other Sources of Help
Microsoft TechNet CD-ROMs
contain the Microsoft Knowledge Base (the same library that Microsoft Product Support
Specialists use), resource kits, and educational materials.
Microsoft Network (MSN) and CompuServe
have several Microsoft forums where you can post questions and obtain answers.
Microsoft Download Library (MSDL)
is an electronic bulletin board system (BBS) from which you can download drivers and other
software. The phone number for MSDL is 206-936-6735.
Microsoft Web Site (http://www.microsoft.com)
contains product information, drivers, service packs, and more.
As the article notes, if the problem driver is a Microsoft driver, you can send the dump file to
Microsoft for analysis. If the driver is from a third-party manufacturer, you can send the memory
file to that manufacturer.
I was excited to see Mark T. Edmead’s “The Blue Screen of Death” in your June issue. As a consulting partner in a firm that works on nothing but NT internals, I figure a lot of our clients can learn from an article on NT’s infamous blue screen of death.
The only problem is that much of what they’ll learn from this article is wrong. The article is absolutely rife with technical errors. For example, the second sentence in the section headed “What Does This Weird Screen Mean?” reads, “The kernel STOP may mean that a kernel driver ... has illegally accessed the privileged kernel area.” This statement is very close to meaningless, and any meaning I can attach to it is wrong. Kernel drivers are privileged (i.e., they run in kernel mode) and have full access to the kernel area.
Another example? Let’s take Table 1: Kernel Mode Error Conditions. The first entry for IRQL_NOT_LESS_OR_EQUAL tells the reader, “A process attempted to access pageable memory at a process internal request level (IRQL) that was too high.” That statement is one reason for getting this error. But it’s not the reason.
The text continues, “A process can access only objects that have priorities (IRQL) equal to or lower than its own.” This statement is nonsense. Objects don’t have IRQLs. IRQLs are not typically associated with processes (with one exception). The IRQL indicates the CPU state at any point in time, relative to that CPU’s interruptability, preemptability, and dispatchability. That is, the IRQL identifies which devices can interrupt the CPU, whether at the end of a quantum the current thread will be rescheduled, and whether any scheduling operations at all are allowable.
Yikes! I found plenty of other technical problems in the article, too.
I’ve spent lots of time reading blue screens and debugging drivers, so I recognized the problems with this article. How many of your readers can say the same?
--Peter G. Viscarola
Thanks, Peter, for pointing out these problems. We’ll be very careful not to let such errors slip by in the future, and we apologize for any inconvenience these errors may have caused anyone.
--Karen Forster
I am looking forward to reading the Blue Screen of Death article because I am having a problem with the KERNEL and DLL. At quick glance I see the article refers to this problem in connection with using Windows NT. I am not using Windows NT and am using Windows ME 2000 on something called an etower by a company called emachines. I will read the article and try to understand it before I start screaming for more help! I've been using a computer and the Internet for several years but still have trouble learning what I need to know to keep it working in good order. I do appreciate the Windows&.Net Magazine and really appreciate the response I receive from you tech support person. With this site on my "favorites" I hope to slowly gain some of the knowledge I need. Thanks. RA
I have done all the steps mentioned and obtained memory.txt but the file is of 0MB but the actuall user.dmp is of 4MB size. I need this user.dmp fiel in readable format.Pls guide me ..
Thanx in Advance
kmk