Lessons you can learn about SP2 and being prepared before it's too late

Have you had a network disaster yet? My company has had a few, and we learn from each one. But our latest one taught us some things you might want to know so that you can avoid your disaster.

Our network follows a small master-resource domain model. The master domain, ORION, stores all the user accounts. ORION has only one server, BETELGEUSE, which is the Primary Domain Controller (PDC) for our 31-user network. BETELGEUSE also serves as our Windows Internet Name Service (WINS) server, Dynamic Host Configuration Protocol (DHCP) server, and Domain Name System (DNS) server for our Internet domain mmco.com. ORION has no Backup Domain Controller (BDC--sure, I ought to know better, but it was on my list of things to do). We also have a resource domain, TAURUS, which has two main servers: ALDEBARAN (a PDC file server that also runs the mail system) and ELNATH (a BDC that's also a print server and our SQL Server). So, to summarize, we have 2 domains, 3 servers, and 31 user accounts.

One Friday evening, I installed Service Pack 2 (SP2) for Windows NT 4.0 on all our NT servers. Everything went well on ELNATH and ALDEBARAN. But when I put SP2 on BETELGEUSE and rebooted, I got the following message:

Windows NT could not start because the following file is missing or corrupt:

<winnt root>\system32\ntoskrnl.exe

Please reinstall a copy of the above file.

Arrgh! Service packs strike again! I've seen previous service packs zap servers before, unfortunately. But how to fix it? I'll spare you the litany of steps I took (an NT repair, a reinstallation, etc.) and instead offer you the lessons I learned. In some cases, I already knew the lessons, but I was painfully reminded of them; as Boswell reports that the good Doctor Johnson observed, "Experience is a hard college, but it is the only one that fools will attend." Hey, call me magna dumb laude; how many network administrators perform a major software upgrade on an important server without first backing it up? (Well, at least one, unfortunately.)

LESSON 1
All NT Servers Need a CD-ROM Drive
BETELGEUSE, the master domain PDC, is an old 33MHz 486 with 32MB of RAM. We keep it around only because it's been reliable (except for this case, of course). Additionally, the machine doesn't have to do that much--handling logons and DHCP, WINS, and DNS requests isn't very challenging for a network of our size. BETELGEUSE's CPU utilization level is almost never above 25 percent, so we didn't have a compelling reason to replace the machine. The server has never had a CD-ROM drive because the drives were expensive when we bought the 33MHz 486 years ago. Our NT installations have always used i386 and winnt/b or winnt32/b, so we never needed the NT CD-ROM.

However, if you want to run a complete NT repair you need a CD-ROM drive. (If you're not familiar with doing an NT repair, you just boot from the NT setup floppies like you did to initially install NT. When you get to the blue Welcome to NT Setup screen, press R for repair instead of hitting Enter. Setup then guides you through the repair options. For information on performing an NT repair with an emergency repair disk, see Michael D. Reilly, "The Emergency Repair Disk," January 1997.)

One of my initial thoughts about how to fix BETELGEUSE was to run an NT repair and let the repair restore the system files to their pre-SP2 state. I couldn't do that without a CD-ROM drive. (Because the system crash happened late at night, I couldn't go buy a CD-ROM drive even if I wanted to. Ultimately, I borrowed one from another system, but that took time and isn't a good idea anyway--why create more potential failures when you already have a network problem?)

LESSON 2
Never Make Your Boot Drive an NTFS Partition
If I couldn't do an NT repair, I figured I'd re-install NT. I had a copy of the i386 directory on my hard disk (my C drive), so I started a winnt/b installation. (In case you don't know, winnt/b is a DOS-based program that initializes an NT setup. It's one way to install NT on a system without a CD-ROM drive.) After 90 minutes of file copying (remember, this is a 33MHz 486), the system rebooted. At this point, I expected to see a boot.ini list that included NT 4.0 Installation/Upgrade..., but I didn't. Instead, I saw the original boot.ini list--NT never tried to reinstall itself, it just showed me the old message that ntoskrnl.exe was missing or corrupt. What happened?

Then I remembered. I'd formatted the C drive using NTFS ages ago. I knew that having a C drive formatted as NTFS can cause heartburn when you try to repair a system, but I hadn't had time to reformat the drive before the crash. However, what I didn't realize was how dangerous having an NTFS C drive could be. BETELGEUSE has two physical local drives (C and D). The first physical drive is formatted as NTFS, and NT sees it as drive C. The second physical drive is formatted as FAT, and NT sees it as drive D. However, when I booted DOS to access the FAT drive, DOS saw the second physical drive as drive C, not drive D, because DOS can't see NTFS volumes.

Winnt/b is a DOS-based program, so it overlooked the first physical drive (which I'd formatted with NTFS) and instead wrote boot.ini to drive D thinking it was drive C. Because winnt/b wrote boot.ini to the second physical drive (drive D from NT's perspective), NT ignored it. The result was pretty scary. BETELGEUSE had no CD-ROM drive, and its first logical drive was formatted as NTFS, so I couldn't reinstall NT except by using winnt32--and you can run winnt32 only from inside NT!

In general, the rule for mixing drive formats is to put FAT volumes on lower drive letters and non-FAT file systems such as HPFS and NTFS on higher drive letters. I recommend setting up an NT server with a C drive formatted using FAT and about 300MB. This size is large enough to store i386, a complete NT Server installation, and a pagefile. Put your other applications programs, user data, and home directories on NTFS volumes.

Don't put off disaster recovery until you have time­or you'll end up having to play network McGyver.

LESSON 3
Always Have a BDC
Once again, no rocket science in having a BDC. Consider what I'd have had to do if I couldn't get BETELGEUSE running. I could easily rebuild WINS--just install WINS Server, and it's pretty self-healing. DHCP is more trouble because I'd have to re-create the DHCP scope (a range of IP addresses that the DHCP server can give out). The hardest part of rebuilding DHCP is remembering which IP addresses the DHCP server can't give out. Rebuilding the DNS server would require more work because I'd have to re-create all the zones and re-enter the names of the machines I wanted on the DNS list. Troublesome, but not the end of the world.

The most difficult function to restore would be BETELGEUSE's function as the PDC. I'd have to rebuild the trust relationships and user accounts, and reset permissions on the home directories--the process would be a mess. A BDC removes these worries; if the PDC crashes, you can promote the BDC to PDC. ORION didn't have a BDC because we already had three servers for about 30 people, and setting up another server was a project for the future. However, because we had lots of NT workstations on people's desks, I should have purchased an extra copy of NT Server and put it on one of the administrator PCs as the desktop operating system instead of NT Workstation. The BDC function would use a little extra RAM and CPU power, but probably not a noticeable amount (I recommend this solution to other small network administrators).

LESSON 4
Always Have Backups of the Right Files
As an author, I can't afford to lose any of the books, magazine articles, PowerPoint presentations, or programs I've written, so I have an elaborate backup scheme with rotating offsite backups. I realized after the fact that this backup scheme was good only on the resource domain, TAURUS, which is where I store most of my valuable information. I thought backups included ORION, but I never checked. Our system crash is the perfect example of why network disaster recovery experts tell users to have a dry run now and then.

LESSON 5
If You Haven't Learned Any Other Lesson, Prepare to Be Inventive
By about midnight, I was starting to feel some time pressure. Our DNS entries have an eight-hour expiration time, and our Internet Service Provider (ISP--which pulls DNS zones from our DNS server and uses them to resolve outside DNS requests) hadn't heard from BETELGEUSE (our DNS server) since about 7:00 pm when the server crashed. People were still able to access our Web server and mail server, but our ISP's DNS server's entries for our Web site would expire in three hours, which would bounce would-be Web surfers and people sending us email.

I had borrowed a CD-ROM drive from another machine by now, and I'd installed NT Workstation on BETELGEUSE mainly because of the time factor--on this slow computer, NT Workstation would take about an hour to load vs. a likely two hours for NT Server. I installed NT Workstation to a new directory, and when it booted, I was able to access the NTFS drive on BETELGEUSE. I copied the contents of the server's \winnt\system32\config directory to another server because that directory contains the Registry.

I found another computer that we had set up as a test machine and PDC for a test domain. Then I did a PDC brain transplant. First, I renamed the computer to BETELGEUSE and its domain to ORION. Second, I rebooted the server under DOS--I had installed this computer with a C drive partition as FAT--and copied the SAM, SAM.LOG, SECURITY, and SECURITY.LOG from BETELGEUSE to the new machine. I then booted the NT Server and, lo! All the user accounts, machine accounts, and trust relationships were intact. I rebuilt WINS, DHCP, and DNS by hand, which wasn't much trouble, and I was able to get the three of them running in about 40 minutes. I could have recovered WINS, DHCP, and DNS from my old BETELGEUSE machine, but rebuilding them by hand was faster.

To get the DNS info, I got a complete listing of our current mmco.com zone by running nslookup, a utility that comes with NT 4.0 (and from other sources), from the command line. I then typed

Server 199.34.57.47 Ls ­d mmco.com

The first line told nslookup to request DNS information from the server at 199.34.57.47, which is my secondary DNS server (remember, the DNS server had my mmco.com information cached locally, and I had an hour or two left before it evaporated). The second line asked nslookup to tell me everything it knew about the mmco.com domain. I printed the results and re-keyed them into the new BETELGEUSE.

One final note: As I expected, WINS healed itself nicely, but I was worried about DHCP. We have our DHCP leases set to seven days, so several machines had seven-day leases that the new DHCP server knew nothing about. I wondered what would happen when these machines attempted to reboot or extend their leases? I'd experimented with this scenario under NT 3.5, and the results weren't very promising. But apparently the DHCP server under NT 4.0 is smarter. When a PC attempts to renegotiate a lease that the DHCP server doesn't know about, the DHCP server sends a NAK (an "I'm refusing your request") message to the PC and then immediately offers the PC the same IP address it previously had! DHCP healed itself and none of our users were affected. So perhaps Lesson 6 is that doing a PDC brain transplant isn't that hard, and that DHCP and WINS are pretty much self-healing. But most important is Lesson 7: Don't put off disaster recovery until you have time--or you'll end up having to play network McGyver, and believe me, it's no fun.