If you've been around computers a while, you might remember 5.25" floppy disks. They didn't hold much, they were slow, and they weren't all that reliable, but they were much less expensive than hard disks—not to mention that many computers had no interface with which to connect a hard disk. Most of us learned to work around the idiosyncrasies of floppies, and when hard disk drives became affordable enough to use, we moved over to them to get their greater reliability. But how reliable are hard disk drives, and how much of what you "know" about their reliability is wrong? Bianca Schroeder and Garth A. Gibson, both of Carnegie Mellon University, know the answers.

In a paper published earlier this year, "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?" Schroeder and Garth present the results of their research, which covers a total population of more than 100,000 SCSI, Fibre Channel, and Serial ATA (SATA) drives from four different vendors. They analyzed the data to answer three questions:

  • Do hard disk drives have to be replaced more frequently than other hardware components?
  • How accurate are hard drive vendors' reliability measurements?
  • How true are the key assumptions about statistical properties of hard drive failures?

These questions were answered based on seven data sets gathered from high-performance computing (HPC) clusters, warranty data, and empirical failure data from ISPs. The results might surprise you.

To answer the first question, Schroeder and Garth examined data from one HPC installation and two ISPs. They found that the root cause of node downtime or failure was due to hard disk drive failure in a high percentage of cases; hard drives were one of the top three components to fail across those three data sets, with failure rates ranging from 18.1 percent to 49.1 percent. In the case of the HPC installation, the authors had detailed information on the number of CPUs, motherboards, and RAM DIMMs, which let them reach the surprising conclusion that—in that environment—the failure rate for hard drives and RAM DIMMs was roughly equal over a five-year period. That's very counterintuitive, given that hard drives have many electromechanical parts and DIMMs have none. CPUs failed 250 percent less often than hard drives, and motherboards failed 50 percent less often than hard drives.

The second question the authors investigated is important given that hard disk drive manufacturers don't usually disclose information about how they calculate mean time to failure (MTTF) measurements. The study data shows that the annual replacement rate (ARR) for hard drives is about 3 percent—much higher than vendor estimates. To put things in perspective, the highest annualized failure rate (AFR) in a vendor datasheet was 0.88 percent, so the vendors are off by a factor of more than three. However, the data didn't show any reliability difference between SATA, SCSI, and Fibre Channel hard drives.

The answer to the third question from the study is of particular interest to Exchange Server administrators. Most of us use RAID as a data protection technology, and a key assumption in RAID design is that hard drive failures are independent. That is, if you lose one hard disk drive in a RAID array, the odds of losing a second one soon after are neither higher nor lower than the odds of losing a hard drive at any other time. It appears that this assumption isn't really true, and that the odds of losing a hard disk drive increase as the number of prior failures increases. This finding can potentially be explained in many different ways, ranging from environmental and power conditions to the likelihood that a manufacturing defect will affect multiple hard drives in the same batch. I don't have space to describe all the authors' observations about this question, but the bottom line is clear: Don't assume that your Exchange hard drives are immune to multiple failures in a short time.

To get the full flavor of Schroeder and Garth's results, you need to read the entire paper. For example, the authors present some data on ARRs calculated over time that seem to indicate that the traditional "bathtub curve" model for failure rates is wrong, and that hard drive replacement rates don't always stabilize at a low level after the initial burn-in period. I'd love to see a similar study from a large Exchange installation (say, Microsoft or one of the larger Exchange hosting companies) to compare with this data, but even without that, the raw results are quite interesting.