Injuries, but no fatalities

The transition to 2000 will have pretty much played out by the time you read this article. Most of the problems that were going to occur happened right away. And based on the smooth transition from December 31, 1999, to January 1, 2000, I doubt you've had too many problems with the Windows NT workstations and servers in your network.

I experienced a minor Y2K bug that conjured up something I hadn't considered. One of my servers, a first-generation 233MHz Pentium II processor running the release to manufacturing (RTM) Windows 2000 (Win2K) code rolled over from December 31, 1999, to January 8, 1601—and I didn't even notice. Before you castigate me for not being observant, or wonder how I could miss this in a network with only a dozen machines, let me explain the scenario that brought the problem to my attention.

I use this server only for data storage. The server has 80GB of hard disk capacity and is the primary storage location on my network for installation software (when I test software, I keep a copy of the installation and setup files). I also store music (20GB or more of Windows Media Audio—WMA—files), many multimegabyte image files, and a bunch of other items I use for testing. The server runs a network management client that alerts me to problems, so if the server goes down, I should receive an alert. But aside from these uses, I don't pay too much attention to the box. The server runs effectively headless (i.e., without a monitor). I use a keyboard/video/mouse (KVM) switch to share a monitor and keyboard between this server and four other servers, and I ordinarily keep the monitor turned off or switched to another server that I use for testing. The server had been running NT 4.0 for months and Windows 2000 Server (Win2K Server) for a couple of weeks without difficulties. No problems, no alerts, no random blue screen of death—the server worked fine.

I have three other servers of similar vintage and configuration, and none of them had any problems rolling over to 2000. So why would I worry about this particular box when it was still up and I could read files from it? I even had some music playing from the server when the New Year rolled in.

My complacency comes from my experience as a Novell NetWare systems administrator. In those days, I didn't think twice about servers because they stayed up for months, or even years. When rolling out NetWare 4.0, I still had NetWare 2.x boxes providing file and print services that had never been down except for memory and storage upgrades. My personal best for a lightly used NT server is 11 months of uptime, so I've also had NT boxes stay up and stable for fairly long periods of time. And these servers, like my server with the Y2K bug, ran only file and print services.

I stumbled across the server's date problem when I checked to see whether some server files matched the files I was deleting from a desktop system. You can imagine my surprise when I noticed that the files on the server showed a date of 01/09/01. I couldn't fathom why the server would skip a year. Logging on to the server console gave me an Invalid Day and Date message. To correct the problem, I needed only to set the correct date and change the date information on a few dozen files that had the wrong date. (If this little problem had occurred on a day that I'd written 5000 or 6000 files out to the server, I might not have felt so nonchalant about the problem. And had a backup run before I caught the error, my backup program might have had an interesting task of resolving dates.)

Like many network administrators, I prefer not to touch the consoles of the servers on my network. (I'm pretty sure the goal of most network administrators is a stable network that requires as little hands-on maintenance as possible.) When possible, I use remote administration tools to manage the servers and their applications. I've installed the necessary tools to perform remote administration, so other than to change backup tapes, I rarely touch my production servers.

Even after my little Y2K problem, I was still pretty comfortable with my management routine. After all, I wasn't likely to encounter a millennium bug again for another 1000 years. The next day permanently shattered my complacency—I got an email message notifying me that the Web site that I run for a regional car club was down. The message surprised me because that Web site uses the same server as my mail server, and I knew my mail server was up or I wouldn't have received that email message. None of my monitoring tools had generated any messages, so I checked the Web site on my local intranet, and sure enough, the site was inaccessible. Other Web sites on the same server were still up, so I launched the Microsoft Internet Information Server (IIS) remote Web-administration tool and checked on the status of the server. According to this tool, all was well.

I restarted the Web site's server to see whether a reboot would fix the problem. The server sent me the correct responses, but the Web site was still down. I checked my local shares on that server, and they were all up. Plus, I could copy files to and from the server. While scanning the directory that contained the files for this Web site, I noticed that some of the HTML files I expected to find were missing. One such file, default.html, is necessary to display the site's home page. Although I host the Web site, I don't maintain it, so I thought the site's Web master might have encountered a connection problem when he was making changes and been unable to upload new files after he had deleted the old ones. The Web master had encountered connection problems in the past and emailed me new files to post to the site.

I decided to pull the files for the site's root directory off the previous day's backup so that the site would be up while I looked for the cause of the problem. As I was waiting for the recovery to complete, I logged on to the server from its console to see whether it had other problems. (I don't run remote control software on my servers because the most remote server is only on the other side of my house.)

After turning the KVM switch to the offending server, I saw a screen full of corrupt-file messages. These messages came from an application that wasn't running on the same disk as the Web site. Because the messages all requested that I run Chkdsk, I rebooted the server and let Autochk run. The automated process ran on all the server's drive volumes and reported no errors on any volume. The system completed the boot process, and as if by magic, when the system came up, the Web site was back with all of its files. Do I know why this happened? No, but I removed the application that had generated all the error messages and I'm hoping the problem won't recur. Now I flip the KVM switch to monitor all the servers once per day or so, hoping to head off any problems that occur before they affect users of my network resources.

Perhaps I was a little more ready to take a hands-off approach to my NT servers than the OS was. I'm not changing my hands-off policy entirely, but I'm amending it to a more proactive stance.