It was the Monday morning after the July 4 weekend. The power went out in the highest building in Philadelphia. Not to worry, the disaster recovery (DR) specialists had that one covered -- the building had a connection to a separate part of the grid. But then the repair crew accidentally severed the backup connection.
"Every disaster has a different face, so no one can accurately predict," says Nick Voutsakis, chief technology officer at Glenmede Trust, a wealth management firm whose headquarters occupies four floors of the building. "Your planning has to be flexible enough to cope."
Incidents like this one give businesses a chance to see their DR technology in action. While some companies pass with flying colours, the plans of others are exposed as incomplete, unrealistic and technologically flawed. So, what are the tried-and-true, best practices, what technologies should be deployed, and how should IT cooperate with the organization as a whole in order to take all necessary precautions?
"Those companies with untested or poorly tested plans will eventually discover that they aren't as protected as they thought they were," says Mike Karp, an analyst at Enterprise Management Associates.
Planning for the unplanned
Some DR plans are too simplistic, don't mesh with the real world and have little value in an emergency. Others are complex tomes that nobody reads. According to Voutsakis, the trick is finding a balance.
But even companies with well-compiled plans can look foolish if nobody can find the plan when they need it. It's no good if it's lost in a binder or in a PC that's down because of the disaster. So keep copies of the plan in multiple locations.
"We include copies of our plan in the emergency packs we provide to employees containing food, medical supplies, flashlights and so on," Voutsakis says.
Glenmede is primarily a Windows 2000/XP shop that uses Cisco Systems switches and Dell servers and desktops. Its DR plan has several layers, depending on the situation. If people can't get to work because of excessive snow, the servers keep running at headquarters and the staff works securely from home. If the building's power goes out, the critical systems can be brought up within four hours at a "hot site" across town owned by business continuity services and outsourcing provider SunGard Availability Services, a unit of SunGard Data Systems. If an event keeps employees out of the building for a week, desktops for key personnel are standing by at SunGard.
During the July weekend outage, Glenmede's management declared an emergency at 7:30am. Since all data is replicated to the hot site, the company had all systems running by 11.30am. But it takes a well-oiled machine to pull that off smoothly. And that means teamwork. "Form a business continuity program with a dedicated team of two to five people, with a senior management sponsor," advises Roberta Witty, an analyst at Gartner.
Glenmede's primary DR committee consists of the CTO, the heads of office services and risk management, and an IT audit member. The committee appointed an extended business continuity group consisting of representatives of 20 business units. These people are trained in business continuity, write the plans and collaborate with their business units. The minutes of both committees' sessions are sent to Glenmede's board of directors.
Each business unit has to evaluate its processes and needs. At The Members Group, a company that provides card-processing and mortgage services to credit unions, the necessary recovery period varied widely by department and time of the month. Payroll, for instance, might be happy with a 13-day recovery window at the start of the payroll period and a 30-minute recovery on payday.
"You have to work with the business units to fully understand the drivers of each application," says Jeff Russell, CIO at The Members Group. It's impossible for a lone IT staffer to appreciate the particular needs of each department. The Members Group uses StoneFly Replicator, an IP storage-area network-based asynchronous disaster recovery product from StoneFly Networks to maintain a mirror image of critical data at a remote location.
While opinions vary as to what constitutes state-of-the-art technology, experts such as Karp of Enterprise Management Associates and Chip Nickolett, a disaster recovery specialist at Comprehensive Consulting Solutions, agree that clustering, SAN mirroring and replication are on the leading edge. However, they warn that these can be expensive technologies.
Among operating systems, OpenVMS and Unix seem to be favoured more than others. Alpha/OpenVMS, for example, has built-in clustering technology that many companies use to mirror data between sites. Many financial institutions, including Commerzbank, the International Securities Exchange and Deutsche Borse, rely on VMS-based mirroring to protect their heavy-duty transaction-processing systems.
Deutsche Borse, a German exchange for stocks and derivatives, has deployed an OpenVMS cluster over two sites situated five kilometres apart. It also uses Fibre Channel switches from Brocade Communications Systems and Cisco switches and routers in its network to ensure high availability. "DR is not about cold or warm backups, it's about having your data active and online no matter what," says Michael Gruth, head of systems and network support at Deutsche Borse. "That requires cluster technology which is online at both sites."
For its part, Windows has as many detractors as advocates. "While we've never failed to recover a Unix system, it's a different story with Windows," says Nickolett. "Common problems include failed restores, software conflicts and issues with patches or service packs."
Forbes.com in New York also favours platforms besides Windows. Each business day, it publishes more than 1500 articles online, making heavy use of an advertising workflow system running on an Intel/Linux platform and a content management system hosted on high-end Fujitsu servers that run Sun Solaris. Both are protected using the Continuous Protection System, an appliance from Revivio. A Gigabit Ethernet line connects to a data centre at an unspecified location using host-based mirroring technology. "We're able to switch to the appliance in the event that the primary system has a problem," says Michael Smith, general manager of operations at Forbes.com.
But not everyone agrees that Windows should be avoided. In fact, the Cancer Therapy & Research Centre (CTRC) in San Antonio stakes its patients' lives on a combination of Microsoft, EMC and Cisco tools for host-based mirroring. At the medical centre, 21 servers -- primarily Windows 2000/2003, plus a few Linux boxes -- store data on an EMC Clariion FC4700 array. Two Cisco SN 5428 iSCSI routers and a Cisco MDS 9506 switch mirror data and large imaging files over a Gigabit Ethernet network to another Clariion array at the research centre 35 kilometres away. According to Mike Luter, CTO at CTRC, it takes 10 minutes to recover a downed server and restore service.
"Business continuity is far more important to us than disaster recovery," Luter says. "We want our applications always available to our patients. If we lost the building, it would take a lot more than a few computer systems to be able to treat our patients elsewhere."
The finest technology and the most skillful planning are about as far as many companies go in DR, and that's nowhere near far enough. It takes testing galore to prepare for the real thing. "Failing to follow through with exercises to locate and correct plan deficiencies is a common error," says John Glenn, a business continuity consultant.
That doesn't mean an IT administrator "dummy-running" the plan over the weekend on his own, Glenn says. You should bring all systems down on a Sunday to see if the remote site operates as planned. And bring in a few dozen employees and run a live test to see how the business units are affected. Can finance continue accounting, sales keep selling and production continue to turn out products? In addition, surprise everyone with a few random exercises during the workweek, suggests Smith of Forbes.com.
"We test our entire plan seven times a year," says Glenmede's Voutsakis. "We evaluate our performance for different levels of disaster and various kinds of events, including sending staff home to see how well they can perform there." He says that the problems that can cripple you during an actual disaster show up only during real-world exercises.
That was the case at The Members Group. It thought it had plenty of bandwidth to replicate off-site. But its T1 lines proved inadequate. For example, its SQL database couldn't be adequately replicated because of bandwidth constraints, so it hasn't been transferred to the IP SAN. Similarly, more than half of the company's servers remain unmirrored. "We're moving our primary facility in May and will add more bandwidth at that time," Russell says.