Windows IT Pro is the authoritative and independent resource for windows nt, windows 2000, windows 2003, windows xp. Features a collection of resources and magazines for windows IT professionals.
  
  
  Advanced Search 


May 03, 2007

The Shocking Truth about Hard Disk Drive Failure Rates


RSS
Subscribe to Windows IT Pro | See More Products / Hardware Articles Here | Reprints | Or get the Monthly Online Pass—only $5.95 a month!

If you've been around computers a while, you might remember 5.25" floppy disks. They didn't hold much, they were slow, and they weren't all that reliable, but they were much less expensive than hard disks—not to mention that many computers had no interface with which to connect a hard disk. Most of us learned to work around the idiosyncrasies of floppies, and when hard disk drives became affordable enough to use, we moved over to them to get their greater reliability. But how reliable are hard disk drives, and how much of what you "know" about their reliability is wrong? Bianca Schroeder and Garth A. Gibson, both of Carnegie Mellon University, know the answers.

In a paper published earlier this year, "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?" Schroeder and Garth present the results of their research, which covers a total population of more than 100,000 SCSI, Fibre Channel, and Serial ATA (SATA) drives from four different vendors. They analyzed the data to answer three questions:

  • Do hard disk drives have to be replaced more frequently than other hardware components?
  • How accurate are hard drive vendors' reliability measurements?
  • How true are the key assumptions about statistical properties of hard drive failures?

These questions were answered based on seven data sets gathered from high-performance computing (HPC) clusters, warranty data, and empirical failure data from ISPs. The results might surprise you.

To answer the first question, Schroeder and Garth examined data from one HPC installation and two ISPs. They found that the root cause of node downtime or failure was due to hard disk drive failure in a high percentage of cases; hard drives were one of the top three components to fail across those three data sets, with failure rates ranging from 18.1 percent to 49.1 percent. In the case of the HPC installation, the authors had detailed information on the number of CPUs, motherboards, and RAM DIMMs, which let them reach the surprising conclusion that—in that environment—the failure rate for hard drives and RAM DIMMs was roughly equal over a five-year period. That's very counterintuitive, given that hard drives have many electromechanical parts and DIMMs have none. CPUs failed 250 percent less often than hard drives, and motherboards failed 50 percent less often than hard drives.

The second question the authors investigated is important given that hard disk drive manufacturers don't usually disclose information about how they calculate mean time to failure (MTTF) measurements. The study data shows that the annual replacement rate (ARR) for hard drives is about 3 percent—much higher than vendor estimates. To put things in perspective, the highest annualized failure rate (AFR) in a vendor datasheet was 0.88 percent, so the vendors are off by a factor of more than three. However, the data didn't show any reliability difference between SATA, SCSI, and Fibre Channel hard drives.

The answer to the third question from the study is of particular interest to Exchange Server administrators. Most of us use RAID as a data protection technology, and a key assumption in RAID design is that hard drive failures are independent. That is, if you lose one hard disk drive in a RAID array, the odds of losing a second one soon after are neither higher nor lower than the odds of losing a hard drive at any other time. It appears that this assumption isn't really true, and that the odds of losing a hard disk drive increase as the number of prior failures increases. This finding can potentially be explained in many different ways, ranging from environmental and power conditions to the likelihood that a manufacturing defect will affect multiple hard drives in the same batch. I don't have space to describe all the authors' observations about this question, but the bottom line is clear: Don't assume that your Exchange hard drives are immune to multiple failures in a short time.

To get the full flavor of Schroeder and Garth's results, you need to read the entire paper. For example, the authors present some data on ARRs calculated over time that seem to indicate that the traditional "bathtub curve" model for failure rates is wrong, and that hard drive replacement rates don't always stabilize at a low level after the initial burn-in period. I'd love to see a similar study from a large Exchange installation (say, Microsoft or one of the larger Exchange hosting companies) to compare with this data, but even without that, the raw results are quite interesting.

End of Article



Reader Comments
bztukm

gilles@gillesc.com May 04, 2007 (Article Rating: )


You must log on before posting a comment.

If you don't have a username & password, please register now.




Top Viewed ArticlesView all articles
No Jobs, No Excitement at Apple's Last Macworld Keynote

Apple CEO Steve Jobs made the right move in skipping out on his company's last appearance at Macworld: In a Tuesday keynote address at the conference, Apple had no interesting new products to sell, opting instead to spend mind-numbing amounts of time on ...

Where is Microsoft NetMeeting in Windows XP?

...

Command Prompt Tricks

One reader shares his tip for setting up the command prompt to reflect a remote path. ...


Related Events Storage Consolidation for Your Microsoft Applications: Reducing Cost and Complexity

Top 10 Email Security Challenges and Solutions

Mastering Exchange 2007 Server Management – May 29, 2008 (11:00 AM EST)

Check out our list of Free Email Newsletters!

News and Analysis eBooks Getting Maximum Performance from Your Web-based Applications

Business Process Automation - Managing Cost in Your Enterprise

Spam Fighting and Email Security for the 21st Century

Related News and Analysis Resources Become a VIP member of the Windows IT Pro community!
Get it all with the VIP CD and VIP access. A $500+ value for only $279!

Subscribe to Windows IT Pro!
Solve your toughest technical problems with our experts and access 10,000 + articles online. 30% off

Monthly Online Pass - Only $5.95!
Get instant access to 10,000+ articles from Windows IT Pro Magazine!

TechNet Virtual Labs
Evaluate and test Microsoft's newest products.


Windows IT Pro Home Register FAQ for Windows WinInfo News
Europe Edition About Us Contact Us/Customer Service Media Kit Affiliates / Licensing  
SQL Server Magazine Office & SharePoint Pro Windows Dev Pro IT Job Hound ITTV
IT Library Technology Resource Directory Connected Home Windows Excavator Windows SuperSite 
 
 Windows IT Pro is a Division of Penton Media Inc.
 Copyright © 2009 Penton Media, Inc., All rights reserved. Terms and Use | Privacy Statement | Reprints and Licensing