Tricks to prepare for and recover from NT meltdowns

That would you do if one of your core production servers crashed the next time you reboot it? More important, how much time would you need to fix the problem? For most Windows NT administrators, the thought of a mission-critical production server experiencing STOP errors (aka the blue screen of death) or any form of server outage makes them break out in a cold sweat.

A hosed NT system is never fun, but an unavailable critical server means lost productivity, lost time, lost money, and, of course, an angry boss. In this first installment of a two-part article, I discuss advanced tools and procedures that you can use to improve the availability of your network servers and to increase your chances of recovering from an NT boot failure. In addition, I delve into lesser-known techniques that you can employ right away to help you recover a downed NT system in the future. In this article, I don't address clustering solutions, and I assume that each system is a standalone, nonclustered NT system without system-level failover.

Common Calamities
Although various circumstances can cause an NT system to crash at startup, the result of these circumstances is usually the dreaded blue screen of death, which Screen 1, page 100, exemplifies. After NT halts the system, it displays this screen to protect the system against data corruption. In addition to being blue as its name implies, a blue screen displays important information about the system's state at the time of the STOP error. The screen lists the STOP code, the location in memory where the problem occurred, and the drivers loaded in memory when the STOP took place. However, pinning down the source of a STOP error isn't always easy. In my experience, a problem usually develops from one of the following scenarios:

  • You install software that corrupts the HKEY_LOCAL_MACHINE portion of the Registry—particularly, software that installs new services or drivers. This action usually results in a STOP error or blue screen, which indicates that the system Registry or a particular hive file failed.
  • You change a system's network configuration, which causes NT to rewrite network bindings and their related Registry entries (i.e., NT corrupts or overwrites critical OS files with invalid or incompatible versions while the system is in use).
  • You install a new service or driver on the system, which causes a system-level incompatibility problem that results in a STOP error when you reboot (i.e., underlying file corruption has occurred on a key system file that you loaded into memory before the corruption).

Each of these situations has a different set of underlying causes and solutions, so let's look at each scenario individually.

Registry Corruption
The system Registry is the heart of an NT installation. Thus, depending on the nature and extent of the damage, a corrupted Registry often results in a STOP error or blue screen of death at startup. Damage to the Registry can be physical or logical. Physical damage means that something (usually disk-related corruption) has scrambled the Registry hive files (e.g., the SOFTWARE or SYSTEM files in the \%winntroot%\system32\config folder). Logical damage means that a third-party application, a user, or NT has written invalid data to the Registry, which can trigger an NT startup failure if the logically damaged Registry entry is critical.

Unfortunately, you can't always tell whether a damaged Registry is the cause of your system's STOP error. The STOP error might identify a telltale sign such as a hard Registry error or a reference to a particular damaged hive file. However, in some cases, the STOP error doesn't indicate Registry damage.

If you suspect a Registry-related problem, the first line of defense is to restore a previous known-good Registry configuration. You can use several methods to accomplish this solution.

The Last Known Good Configuration option. You access this option by pressing the space bar when the system prompts you during the NT boot process, and selecting the option to restore a previous configuration. This method is the quickest and easiest solution, if it works. Unfortunately, this solution's failures outweigh its successes in real-world applications because its scope is only a previously known-good incarnation of one portion of the Registry (i.e., a ControlSet00X Registry subtree of the HKEY_LOCAL_MACHINE\SYSTEM key). You have a better chance of success using the Last Known Good Configuration option if the problem is localized to this portion of the Registry and an event that immediately precedes the invocation of the Last Known Good Configuration option caused the problem. However, this procedure won't cure most of your Registry-corruption ills.

NT Setup's Repair process and an Emergency Repair Disk (ERD). You can use NT Setup's Repair process to inspect and replace individual Registry hive files if the Last Known Good Configuration option fails to resolve the problem. After you insert your ERD, Setup lists the options you can select to specify which portions of the NT installation you want Setup to inspect, as Screen 2 shows. If you select Inspect registry files, Setup displays a list of Registry hive files and lets you select which files you want Setup to replace. Setup takes the replacement files from the ERD or, if you didn't provide an ERD, from the \%systemroot%\repair folder. The ERD and the \%systemroot%\repair folder store replacement files in compressed format, and each hive file has an underscore (_) extension (e.g., SYSTEM._, SOFTWARE._).

Using the most recent replacement files is important so that you don't lose application and service configuration information. (For information about how to update your ERD, see Michael Reilly's "The Emergency Repair Disk," January 1997.) In addition, don't restore the SAM and SECURITY hives on an NT server domain controller, unless you used the rdisk /s (or /s-) option when you ran the ERD utility (i.e., rdisk.exe). Otherwise, Setup overwrites your SAM database with the database version Setup created during the original NT installation and creates a new set of problems. In addition, ensure that you created the replacement files under the same service pack level as the files you're replacing because Service Pack 3 (SP3) and later make security-related changes to the SAM and SECURITY hives. Otherwise, you might not be able to log on after the repair is complete. Restoring the SAM and SECURITY files usually won't resolve your Registry corruption problems anyway because the SYSTEM and SOFTWARE hives usually cause Registry boot problems. Thus, start restoring previous Registry files with the SYSTEM and SOFTWARE files, and replace the SYSTEM hive first because it contains references to important system components, including drivers and services.

An alternate/parallel NT installation. Using an alternate/parallel NT installation to recover the Registry is my favorite solution. Booting an alternate NT installation lets you access NTFS-based volumes on the system that would otherwise be inaccessible, and a parallel installation gives you access to the primary installation's Registry files so that you can repair or replace them. (You can also gain this type of access by using ERD Commander from Systems Internals at http://www.sysinternals.com or NTFSDOS from Winternals Software at http:// www.winternals.com.) After you boot to an alternate installation, you can perform the same actions that you can perform using NT Repair, but with more flexibility and options. Although this method isn't the solution Microsoft recommends, I think it's the best Registry repair process for advanced NT users. (For more information about parallel NT installations, see the sidebar "Think Parallel.")

Before you begin, make a backup copy of the Registry files. I usually back up the existing files into a subdirectory of the folder that contains the Registry files (e.g., \%systemroot%\system32\ config\backup). After you back up the files, you can experiment with replacing individual Registry hive files. However, you can't simply copy the replacement versions, because the ERD and \%systemroot%/repair folder store these files in compressed format. To use the files, employ the expand.exe command to manually expand them. For example, to expand a compressed copy of the SYSTEM hive from an ERD or the \%systemroot%\repair folder, type the following command at an NT or DOS command prompt:

expand system._ system

Copy the resulting file to the \%systemroot%\system32\config folder of the primary installation, and reboot the system.

If you don't want to deal with compressed files, you can use the Microsoft Windows NT Server 4.0 Resource Kit regback.exe utility to maintain extra copies of the Registry. This handy tool makes a backup that contains all the system Registry hive files in uncompressed format. In addition, this tool automatically backs up the SAM and SECURITY hives, so you don't have to worry about using special switches. However, regback.exe's uncompressed Registry copies consume a lot of space and might not fit on a 3.5" disk. The safest place to store regback.exe-created Registry backups is on a partition other than the NT boot partition—preferably a partition on a different physical hard disk. For maximum protection against hardware-related failures that render the Registry hive files inaccessible, store an extra copy of each server's Registry on a different system.

Overwritten or Corrupted Files
One of NT 4.0's serious downfalls is its use of shared system files, which third-party application vendors can freely overwrite with out-of-date or otherwise incompatible support files. In addition, NT doesn't do much to protect itself against the replacement of other key system files, such as system services' files and drivers. In some cases, these conflicts are merely annoying because they cause unwanted errors or application failures. However, this type of problem can result in the inability to start NT. (Windows 2000—Win2K—removes some of this risky exposure by privatizing application DLLs and providing greater protection from overwriting critical system files.)

To repair damaged or incompatible files on an NTFS volume, you can use a parallel NT installation or NT Setup's Repair process. To repair FAT volumes, you can use a DOS or Windows 9x boot disk to access the volume.

Replacing files from a parallel installation is easier if you know which files are invalid or damaged. As a disaster-prevention measure, create an installation source on your hard disk or a CD-ROM that contains copies of the latest core NT system files for the service pack on your system. If you're running a parallel NT installation that you patched to the same service-pack level as the primary installation, you can use that installation as your source. However, if your parallel installation isn't the same service-pack level as your primary installation, create a separate directory that contains the latest versions of the primary installation's files.

To use NT Setup's Repair process to replace damaged or conflicting files, select the Verify Windows NT system files option when Setup presents you with the list of repair options. Microsoft intended this feature to let you quickly identify files that are different from the original NT installation files. However, an NT installation that you've installed a service pack on causes Setup to list most files as unoriginal because the service pack has modified them. Thus, your best bet is to instruct Setup to replace all nonoriginal files by selecting the A option and reapplying the latest service pack after NT is back up and running.

Alternatively, you can replace NT system files with original versions using NT Setup's upgrade option to reinstall NT. Although some users circumvent the previous NT Setup Repair process and jump into an upgrade installation, I don't recommend this solution for several reasons. First, the upgrade process usually takes much longer than the repair process. Second, the upgrade process is more involved and poses greater risks to your system. Finally, if an upgrade installation successfully resolves your original problem, it will probably cause a tcpip.sys blue screen error (i.e., STOP error 0x00000050). When you install NT 4.0 or NT 4.0 SP1 over NT 4.0 SP2 or later, the installation doesn't replace the SP2 or later version of tcpip.sys. Thus, the driver fails the base version of NT or NT SP1. To avoid this mess, first use the NT Setup Repair process' Verify Windows NT system files option to replace the existing files with the original versions. If NT Setup's Repair process doesn't resolve the boot problem, you can run the NT Setup upgrade option without fear of the tcpip.sys blue screen, because NT Setup's Repair process has replaced the SP2 or later version of tcpip.sys with the original version.

An Ounce of Prevention
The difference between a quick fix and a major nightmare is often one preparatory step. Tools, such as parallel NT installations and additional backup copies of the Registry, improve your chances of resolving NT startup failures. Therefore, be sure that your servers are always prepared for the worst.

Next month, I'll discuss the third most common cause of NT startup blue screens: an autostarting service or driver that causes a STOP blue screen when it initializes. I'll teach you about some additional recovery tricks, including a method for remotely repairing the Registry of a failed installation from within a parallel NT installation. In addition, I'll show you third-party tools that can bail you out of trouble when a system won't boot.