Windows IT Pro is the leading independent community for IT professionals deploying Microsoft Windows server and client applications and technologies.
  
  
  Advanced Search 


May 03, 2007

The Shocking Truth about Hard Disk Drive Failure Rates

RSS
Subscribe to Windows IT Pro | See More Products / Hardware Articles Here | Reprints | Or get the Monthly Online Pass—only $5.95 a month!

If you've been around computers a while, you might remember 5.25" floppy disks. They didn't hold much, they were slow, and they weren't all that reliable, but they were much less expensive than hard disks—not to mention that many computers had no interface with which to connect a hard disk. Most of us learned to work around the idiosyncrasies of floppies, and when hard disk drives became affordable enough to use, we moved over to them to get their greater reliability. But how reliable are hard disk drives, and how much of what you "know" about their reliability is wrong? Bianca Schroeder and Garth A. Gibson, both of Carnegie Mellon University, know the answers.

In a paper published earlier this year, "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?" Schroeder and Garth present the results of their research, which covers a total population of more than 100,000 SCSI, Fibre Channel, and Serial ATA (SATA) drives from four different vendors. They analyzed the data to answer three questions:

  • Do hard disk drives have to be replaced more frequently than other hardware components?
  • How accurate are hard drive vendors' reliability measurements?
  • How true are the key assumptions about statistical properties of hard drive failures?

These questions were answered based on seven data sets gathered from high-performance computing (HPC) clusters, warranty data, and empirical failure data from ISPs. The results might surprise you.

To answer the first question, Schroeder and Garth examined data from one HPC installation and two ISPs. They found that the root cause of node downtime or failure was due to hard disk drive failure in a high percentage of cases; hard drives were one of the top three components to fail across those three data sets, with failure rates ranging from 18.1 percent to 49.1 percent. In the case of the HPC installation, the authors had detailed information on the number of CPUs, motherboards, and RAM DIMMs, which let them reach the surprising conclusion that—in that environment—the failure rate for hard drives and RAM DIMMs was roughly equal over a five-year period. That's very counterintuitive, given that hard drives have many electromechanical parts and DIMMs have none. CPUs failed 250 percent less often than hard drives, and motherboards failed 50 percent less often than hard drives.

The second question the authors investigated is important given that hard disk drive manufacturers don't usually disclose information about how they calculate mean time to failure (MTTF) measurements. The study data shows that the annual replacement rate (ARR) for hard drives is about 3 percent—much higher than vendor estimates. To put things in perspective, the highest annualized failure rate (AFR) in a vendor datasheet was 0.88 percent, so the vendors are off by a factor of more than three. However, the data didn't show any reliability difference between SATA, SCSI, and Fibre Channel hard drives.

The answer to the third question from the study is of particular interest to Exchange Server administrators. Most of us use RAID as a data protection technology, and a key assumption in RAID design is that hard drive failures are independent. That is, if you lose one hard disk drive in a RAID array, the odds of losing a second one soon after are neither higher nor lower than the odds of losing a hard drive at any other time. It appears that this assumption isn't really true, and that the odds of losing a hard disk drive increase as the number of prior failures increases. This finding can potentially be explained in many different ways, ranging from environmental and power conditions to the likelihood that a manufacturing defect will affect multiple hard drives in the same batch. I don't have space to describe all the authors' observations about this question, but the bottom line is clear: Don't assume that your Exchange hard drives are immune to multiple failures in a short time.

To get the full flavor of Schroeder and Garth's results, you need to read the entire paper. For example, the authors present some data on ARRs calculated over time that seem to indicate that the traditional "bathtub curve" model for failure rates is wrong, and that hard drive replacement rates don't always stabilize at a low level after the initial burn-in period. I'd love to see a similar study from a large Exchange installation (say, Microsoft or one of the larger Exchange hosting companies) to compare with this data, but even without that, the raw results are quite interesting.

End of Article



Reader Comments
bztukm

gilles@gillesc.com May 04, 2007 (Article Rating: )


You must be a registered user or online subscriber to comment on this article. Please log on before posting a comment. Are you a new visitor? Register now




Top Viewed ArticlesView all articles
Battery Life Issues Almost Certainly Not Windows 7's Fault

While Microsoft is still investigating a notebook battery life issue that was supposedly caused by Windows 7, some interesting trends have emerged. ...

Confirmed: Battery Life Issues Not Windows 7's Fault

Microsoft on Monday issued a lengthy statement about the recent Windows 7 battery controversy, echoing my assessment from earlier in the day, but backing it up with hard, cold evidence. ...

Getting your iPhone to Sync with Exchange 2003

Follow these steps to use an iPhone with Exchange. ...


Related Events Top 5 Key Technologies Changing The Face of Exchange and Data Protection

Bail Out Your Exchange Environment

Check out our list of Free Email Newsletters!

News and Analysis eBooks Getting Maximum Performance from Your Web-based Applications

Business Process Automation - Managing Cost in Your Enterprise

Spam Fighting and Email Security for the 21st Century

Related News and Analysis Resources Introducing Left-Brain.com, the online IT bookstore
Looking for books, CDs, toolkits, eBooks? Prime your mind at Left-Brain.com

Discover Windows IT Pro eLearning Series!
Clear & detailed technical information and helpful how-to's, all in our trademark no-nonsense format


Windows IT Pro Home Register FAQ for Windows WinInfo News
Europe Edition About Us Contact Us/Customer Service Media Kit Affiliates / Licensing  
SQL Server Magazine Office & SharePoint Pro DevProConnections IT Job Hound
Left-Brain.com Technology Resource Directory asp.netPRO ITTV Windows SuperSite 
 
 Windows IT Pro is a Division of Penton Media Inc.
 © 2010 Penton Media, Inc. Terms of Use | Privacy Statement