Multiple short outages can add up to major problems
- 13 May, 2008 12:14
Corporate executives have long created IT plans to cope with major disasters, but now they're increasingly taking steps to prevent the brief shutdowns that can cost companies hundreds of thousands of dollars or more in their own right.
Users and analysts at IDC's Enterprise Data Center Forum here last week listed several options for quickly recovering from or preventing relatively minor incidents -- like user miscues or electricity brownouts -- that can shut down systems for an hour to a half-day or so.
Doug Roberts, manager of system services at Hannaford Bros, became aware of the threat posed by seemingly minor incidents about 10 years ago, when his company had a single data center with a diesel generator for backup.
At the time, the US-based supermarket chain was focused on preparing for major disasters. "We'd do the big four-and-a-half-day disaster recovery event, planning for a hurricane or whatever," Roberts said. "We'd go to the IBM facility, practice the drill."
Then an incident completely out of Hannaford's control temporarily shut down the data center and the backup generator. At a truck yard across the street, Roberts said, an 18-wheeler "did a U-turn and [accidently] dumped the contents of its fuel tank." The city shut down all power to the area and wouldn't allow Hannaford to use its generator because of the risk of fire.
After that incident, Hannaford installed near-real-time backup systems for its mainframes and key Unix and Windows servers at another data center about seven miles away, as well as at a smaller facility in upstate New York. "It's kind of a poor man's cluster," Roberts said.
In an August 2007 IDC survey of 350 data center professionals, about 37 per cent of the respondents said that their data centers had experienced an outage of some sort. The survey did not ask about the length of outages or when they occurred.
Matthew Eastwood, an IDC analyst, said human error is the most common cause of data center outages. Causes range from mistakenly hitting the emergency power-off button to tripping over a power cord.
The second most common causes of outages are incidents outside of the data center's control, such as what happened at Hannaford.
Eastwood said that data centers can also face problems when cooling and power equipment, which are often overseen by the facilities group, are not in sync with IT requirements.
"Both groups should report into the same organization," or at least they should better coordinate their plans, Eastwood said.
Toyota Financial Services found another route to cutting down on short-term data center outages.
Not too long ago, the company had what it considered major incidents -- outages of at least an hour -- three or four times a week, according to Dave Howard, national manager of service management at Toyota's financing arm. The problems included downed networks, enterprisewide application problems, and server or facility outages, he said.
Since the company adopted the Information Technology Infrastructure Library, or ITIL -- best practices for managing systems and networks -- outages have been cut back to one every three or four months, according to Howard.
"Because we have better incident and change management, when something goes down today, we know what happened," Howard said. "And 70 per cent to 80 per cent of the time, it's our service providers and not us."
Left unchecked, outages could have increasingly dire consequences for businesses, analysts noted.
Consider the fact that, as part of the movement toward data center consolidation and server virtualization, companies are centralizing increasing amounts of equipment in single data centers. "If 60 per cent of your assets are centralized in one data center [and] the data center is down, the business is down, too," said IDC analyst Michelle Bailey.
She said data center managers should use that kind of reasoning to convince wary executives of the potential ROI of new systems that could prevent potential disasters.