The statistics of mean time between failures (MTBF) and average failure rate (AFR) have gotten lots of attention lately in the storage world, especially with the release of three much-discussed studies devoted to the topic in the last year. And for good reason: Vendor-stated MTBFs have risen into the 1 million-to-1.5 million-hour range, equaling 114 to 170 years, a lifespan that no one is seeing in the real world.
Three studies over the past year on MTBF include the following:
- Google's "Failure Trends in a Large Disk Drive Population"
- Carnegie Mellon University's "Disk Failures in the Real World"
- University of Illinois System "Are Disks the Dominant Contributor for Storage Failures?"
" MTBF is a term that's in growing disrepute inside the industry because people don't understand what the numbers mean," says Robin Harris, an analyst at Data Mobility Group who also runs the StorageMojo blog. "Your average consumer and a lot of server administrators don't really get why vendors say a disk has a 1 million-hour MTBF, and yet it doesn't last that long."
Indeed, "how do these numbers help a person who wants to evaluate drives?" says Steve Smith, a former EMC employee and an independent management consultant. "I don't think they can.
Even storage system maker NetApp acknowledges in a response to an open letter on the StorageMojo blog that failure rates are several times higher than reported. "Most experienced storage array customers have learned to equate the accuracy of quoted drive-failure specs to the miles-per-gallon estimates reported by car manufacturers," the company says. "It's a classic case of 'Your mileage may vary' -- and often will -- if you deploy these disks in anything but the mildest of evaluation/demo lab environments."
Study results
The upshot of the recent studies can be summarized this way: Users and vendors live in very different worlds when it comes to disk reliability and failure rates.
Consider that MTBF is a figure that's reached through stress-testing and statistical extrapolation, Harris says. "When the vendor specs a 300,000-hour MTBF -- which is common for consumer-level SATA drives -- they're saying that for a large population of drives, half will fail in the first 300,000 hours of operation," he says on his blog. "MTBF, therefore, says nothing about how long any particular drive will last." In other words, MTBF does a very poor job communicating what the actual failure profile looks like, he says.
It's like providing the average woman's height in the US but without showing the numbers used to derive that average, Smith says. "MTBF became the standard because it was perceived as a simpler answer to the question of reliability than showing the data of how they arrived at it," Smith says. "It's an honest-to-God simplification."
Stan Zaffos, an analyst at Gartner, agrees. While he believes MTBF is an accurate representation of what the vendors are experiencing with the technology they're shipping, it's also difficult to translate into something meaningful to end users. "It's a very complex and tortuous route to undertake, requiring a lot of solid engineering experience and an understanding of probability and statistics," he says.
According to Harris, the industry has tried to be less misleading by using AFR instead of MTBF "People want to know, in a given year, what percentage of drives they can expect to fail," says Bianca Schroeder, a co-author of the Carnegie Mellon study.
Read up on the latest ideas and technologies from companies that sell hardware, software and services. Controlling storage costs with Oracle database 11g
Business Intelligence and Enterprise Performance Management: Trends for Emerging Businesses
The state of Middleware
Email Archiving 101—Customer Case Study
Making the Business Case for IT Consolidation
Taking On Demand CRM Integration to the Next Level
Gaining Competitive Advantage Through Enterprise Planning
IT Service Management Needs and Adoption Trends: An Analysis of a Global Survey of IT Executives
Zones provide focussed content from Computerworld and leading technology partners.Discover how SOA can create smarter outcomes for your business.
Attend and learn:
- How SOA is helping leading companies to become more agile
- Where you should be applying SOA processes in your company
- The top SOA implementation mistakes to avoid
Click here for more information.
- +
Computerworld Live Podcast #97: The Future of Enterprise Networking 25/07/2008 09:45:36
This week CW Live chats with Mark Thompson, global sales and marketing manager for HP ProCurve, on the future of the enterprise networking. Mark discusses the trends we can expect to see in the near future and how the right infrastructure can ensure your enterprise network is secure. - +
Computerworld Live Podcast #96: Security at the Edge 11/06/2008 09:22:22
CW Live speaks with Amol Mitra, HP ProCurve Director of Marketing for Asia Pacific and Japan. Today's topic: how enterprises are starting to shift away from simply controlling security via server logins, firewalls and moving to more adaptive security frameworks. - +
Data Management Edition #10: Multi-Petascale Systems 02/05/2008 09:12:33
This week we look at sustainability and the development of multicore technologies to build multi-petascale systems. - +
IT Security Edition #11: How to poison the Storm botnet 01/05/2008 08:51:55
This week CW Live presents a case study on how to poison the notorious Storm botnet . Plus we take a look at Cisco's plans for Ironport. - +
IT Security Edition #10: Cyber-battles fought and won 24/04/2008 11:09:47
Vendors bow to end user pressure to improve product security, and we take a look at the latest concepts shaping the cyber-battlefield of the future.
FrontRange Solutions launches HEAT Plus Mobile to reduce help desk costs and improve service management productivity 2008-12-02 15:15:00+11
AARNet Helps to Advance Indigenous Health 2008-12-02 12:44:00+11
Orbis selects Telstra International as its data centre partner for the UK, Europe and Middle East Region 2008-12-02 11:23:00+11
ComOps Deploys Corporate Performance Reporting Solution For Healthcare Test Manufacturer 2008-12-02 10:09:00+11
Mornington Peninsula Shire implements Objective to manage knowledge and deliver service excellence 2008-12-02 09:56:00+11
Refresh your AUP: Top tips to ensure your acceptable use policy is fit for purpose
Your organisation may well have devised and implemented an Acceptable Use Policy (AUP) some time ago in order to guard against the risks of inappropriate use of computer systems by your workers, but are you confident that your AUP remains 'fit for purpose'? Read on to discover how you can enhance the effectiveness of your AUP.












