Avert unrecoverable failures

Last month was a bad month. I had to deal with two catastrophic hard disk failures and a couple of near misses in my primary desktop system. And I had to clean up the mess that each failure left myself. The primary disadvantage of building your own computers is that you can't yell at someone else when something bad happens.

To begin, I'll describe the hard disk subsystem that I use, because it's a bit unusual for a desktop system. I'm a firm believer in SCSI devices, and I use only SCSI-based storage devices (i.e., other than the 3.5" drive). Thus, the hard disks, CD-ROM drives, and tape units that I use are SCSI. When the first hard disk failed, my system had an 8X CD-ROM, a 1GB hard disk, one 4.5GB hard disk, one 9GB hard disk, two 2.1GB hard disks in individual external cases, and one external DAT backup installed. These devices connected to an Adaptec AAA-133 three-channel caching UltraSCSI controller designed for servers. I really like this controller even though it's more expensive than a standard workstation SCSI controller. The AAA-133 offers good hardware RAID support, although I wasn't using RAID at the time, and comes with advanced software-management tools, including the Adaptec CI/O Array Management software and the Adaptec firmware SCSI support utilities. The AAA-133 is also a cost-effective multichannel product when you compare its cost to that of multiple less-expensive controllers, such as the ubiquitous Adaptec 2940. Vendors such as IBM, HP, and Dell include another member of the AAA-130 controller family onboard many high-end Windows NT workstations.

At the start, I had the 1GB boot disk, the two external hard disks, and the DAT backup on the first channel; the 4.5GB and 9GB hard disks on the second channel; and the CD-ROM on the third channel. I use a large, server-style case with a big power supply and extra fans for this desktop system, which sports dual-Pentium II 266MHz processors and 160MB of RAM. I had also accumulated some extra hard disks. You never know when you'll need a spare hard disk or extra storage capacity, and the prices some sites (e.g., http://www.onsale.com) listed were too good to pass up. I purchased a 1GB hard disk for $90, a pair of Micropolis 4.5GB Fast/Wide hard disks for $139 each, a 9.1GB Seagate Elite hard disk for $200, and for some reason—you really have to watch impulse buying—a 2.1GB Seagate SCA interface drive with a SCSI adapter for about $89.

The first sign of trouble was when one of my external hard disks started to chirp for no apparent reason. In 1 week, the sound went from the occasional cricket to a full-blown chorus, so I heeded the warning of a disk failure. Fortunately, the disk contained only installation files for applications (e.g., installers, .cab files). Preventing a full disk failure wasn't a big deal. I simply copied the contents of the disk to one of my servers, removed the disk, and created a network share mapped to the same drive letter to prevent confusion when applications look for their installation files. I did a low-level format of the hard disk, which returned it to the occasional-chirp state, and installed a copy of NT Workstation on the disk—just in case I needed it. I left the disk in the chain, but I turned it off. I averted failure number one.

One morning 2 days later, I walked into my office and the system was down. The boot disk failed. Although the SCSI controller recognized the disk, the system didn't boot. This hard disk was only the boot device, not the disk on which NT lived. Because this disk had the only FAT partition in the system, I spent half the day with the usual assortment of disk recovery tools and tried to recover the disk. I eventually gave up, replaced the disk with the 1GB hard disk I had on the shelf, and restored the contents from a tape backup. I didn't lose anything but a day's worth of work time and a bit of patience. I dealt with failure number two.

A couple of days passed, and I again found my system dead when I entered my office in the morning. This time, a message had appeared on the screen. The boot loader couldn't read the system files on the system disk. I booted to my backup NT installation and ran the chkdsk utility on the faulted disk. The procedure fixed a half-dozen errors on the disk and rebooted into a normal startup. I dodged that bullet, right?

No such luck. A week later, I got back from a meeting and the system disk was toast. The SCSI BIOS hung when BIOS tried to detect the disk. It was about as dead as a disk can get. And to add insult to injury, this 4.5GB hard disk was the newest disk in the system (other than the recently replaced boot disk), with less than a year's worth of use. And to make matters worse, I didn't have a recent backup of this disk. I had backups of some of the important files, such as my outlook.pst file and the mail data files from Eudora. But my most recent disk backup was too old to reliably use with the upgraded applications and service packs that I had applied more recently. A week's worth of work on current projects was totally gone.

I tried every trick I knew to get the disk back up so that I could pull the data off, but to no avail. Recovery from this failure still isn't complete. I took nearly 4 days to get the system back up and running, restore the OS and applications, recover some of the data, and configure my system the way I like. I dealt with failure number three.

But the computer gremlins weren't finished with me. Minutes after getting the system buttoned up with a replaced 4.5GB hard disk, the computer began to make a screeching ball-bearing noise. After pulling the case apart again, I discovered that the noise came from my 9.1GB hard disk, which was also fairly new. I didn't want to deal with another disk failure, so I rebooted the system, copied the contents of that 9.1GB hard disk to a server, and replaced the disk with the Seagate Elite 9.1 hard disk. Although the screeching 9.1GB disk is in my dead pile, at least it didn't cause any extra problems. I averted hard disk failure number four.

Everything is working now, and the system has been stable for 3 weeks. I have one minor problem with the Micropolis 4.5GB hard disks: The disks don't consistently respond fast enough for the Adaptec controller. About half the time, I get a Disk drive not ready response during the BIOS check of the drives when I reboot. But after the system completes the boot, the Micropolis 4.5GB hard disks are always available to the OS and the Adaptec management system software. I hope the slow response during reboot is just a bug in the disk-drive firmware. (Micropolis went belly-up, and as a result, technical support is a bit difficult to find.)

I've learned the obvious lesson from these hard disk failures: Keep my backups current. But I also discovered how tough it is to find devices and software to regularly back up more than 20GB of desktop storage (somewhat out of the ordinary for a desktop system). Next month, I'll describe the backup strategy that I decided to use and how I implemented it. I'll keep my fingers crossed and hope that I don't have more hard disk failures until I have a complete backup strategy in place.