When good disks go bad
- 19 September, 2007 11:49
- Comments
Despite their increasing complexity in terms of both size and functionality storage systems have achieved an impressive level of reliability. This is particularly noteworthy given the fact that they are engineered around electro-mechanical devices (i.e. disks) that are among the components most prone to failure within the datacenter. The safeguards and redundancies designed into modern storage systems routinely handle most device failures as a regular matter of course with little or no impact to the overall operation.
As is often the case with technology, they have reached a point where this reliability is often taken for granted, especially by those who aren't spending their days (and nights) on the storage management front lines. It's easy to forget that when things do go wrong, they can go very wrong. Occasionally, it's helpful to be reminded.
A friend of mine, who is the CIO for a mid-sized organization, recently shared his 72-hour nightmare experience with me. Their storage system, which housed key applications and email inexplicably went down one evening. After contacting vendor support, they learned that the apparent reason for the outage was a known firmware bug that caused the controller to think that there were multiple drive failures.
Now, one might ask why the organization, under a valid support contract, had no prior notification about the firmware update to address such a serious problem. It seems that such a notification should have occurred, but hadn't.
If this had been the only problem, the outage, while serious, could have been resolved in fairly short order. Unfortunately, the problem was exacerbated by a series of tech support mishaps in the firmware update and system recovery process that led to multiple rebuilds and an extended period of time with the system needlessly operating at risk in a severely degraded mode.
This organization fell victim to what can be described as the Achilles' heel of storage infrastructure - the intersection of technology bugs and human error. This is a highly unpredictable type of risk, and unfortunately the opportunities for prevention and avoidance are few. Some things that can be done to reduce the likelihood of such a situation include:
- Verifying that you are receiving notifications of critical patches and updates
- Keeping configuration management information current
- Establishing a process to quickly flag and update at-risk systems
- When dealing with vendor technical support in critical recovery situations, triple-check, escalate, and obtain expert approval.
Jim Damoulakis is chief technology officer of GlassHouse Technologies, a leading provider of independent storage services. He can be reached at jimd@glasshouse.com.
- Bookmark this page
- Share this article
- Got more on this story? Email Computerworld
- Follow Computerworld on twitter
- 10 Essential Steps to Email Security
- A Technical Overview of the Oracle Exadata Database Machine and Exadata Storage Server
- Get the Whole Picture Why Most Organizations Miss User Response Monitoring—and What to Do About It
- Key Considerations in Modernising Your Backup and Deduplication Solutions
- Botnets: The dark side of cloud computing
-
Wednesday Grok: Microsoft’s browser lockout is to be pitied more than despised
-
Change My Password logs 10 millionth account
-
Brain drain: Where Cobol systems go from here
-
The ABCs of camera phone technology
-
Change My Password logs 10 millionth account
-
Microsoft Office
-
Windows 7 for Dummies® Dvd+book Bundle
-
Windows 7 for Seniors for Dummies®
-
Office 2007 for Dummies
-
Computers for Seniors for Dummies, 2nd Edition
-
Windows 7 for Dummies®
-
Teach Yourself Visually Windows 7
-
MYOB Software for Dummies 6E Australian Edition
-
Office 2007 All-In-One Desk Reference for Dummies









Comments
Post new comment