Hurricane Katrina disaster recovery lessons still popping up
- 07 May, 2008 10:33
For at least three days prior to when Hurricane Katrina struck, Marshall Lancaster and his IT team at Lagasse were closely tracking the storm, hoping it would spare his company's New Orleans-based headquarters and data center but preparing for the worst. By the time Katrina made landfall early on a Monday morning in August 2005, Lancaster and his team were in Chicago at the company's backup data center, having already declared a disaster.
At the time, Lancaster was an IT executive with Lagasse, a subsidiary of United Stationers, where he now serves as vice president of IT, Enterprise Infrastructure Services. While Katrina ravaged New Orleans, Lagasse experienced no system down time. In fact, the day after Katrina hit, the company recorded its second-largest sales day, and its third-largest the day after that.
Lancaster related his Katrina experiences in a keynote address at the recent Network World IT Roadmap event in the US. He spoke of the need to consider the people element in disaster planning and how when a disaster strikes, it bears little resemblance to any pre-planned disaster recovery drill. (Read sidebar: "Disaster recovery tips and lessons learned".)
"When an event occurs, it isn't just about whether or not your systems come back online, but where's everybody going to be?" Lancaster said.
Anatomy of a disaster
Lagasse was battle-tested by the time Katrina rolled in, having experienced four hurricanes in the previous few years: Isadore and Lilli in 2002, Ivan in 2004 and Dennis earlier in 2005. Indeed, the company had the drill down pat.
On Thursday, August 25, Lancaster and his team began to take serious note of Katrina by implementing a "Level 1 inclement weather policy," Lancaster said. That basically just tells employees the company is tracking the storm.
The next day, the company went to Level 2, which is when it tells its associates to make sure their homes are in order, with sufficient supplies of food, water and the like. "We were still pretty hopeful [Katrina] was going to veer," he said.
By Saturday morning, August 27, the five computer models Lagasse was tracking all showed the storm pointed at New Orleans. The only question was whether it would be a direct hit. But Katrina was by now so powerful that even a glancing blow was likely to mean substantial damage.
The company declared a Level 3 emergency that morning, which meant planning for the headquarters and data center to be closed on Monday morning. Critical personnel had to be transported to somewhere safe, with access to communications.
Those critical personnel included Lancaster and his IT team, who headed to Chicago to make sure the company's backup systems were ready. "We still had a lot of unfounded optimism that this storm would pass us by and we would be spared," he said.
By that night, with all meteorological models showing Katrina making a direct hit on New Orleans, that optimism was gone. "At 8:55 p.m., we decided to declare a disaster." That means turning on the disaster recovery platforms and using them going forward. By midnight, all Tier 1 applications were online and tested. By 7:33 p.m. the next day, all Tier 2 applications were available. "That means all customer-facing business capacity was online and working."
At 6:10 a.m. on Monday, August 29, Katrina made landfall in New Orleans. From Chicago, Lancaster and his team monitored their New Orleans data center, to see whether the backup generators and other redundant features in place would keep it operational. "Less than an hour and a half after the storm arrived, our New Orleans data center went dark," Lancaster said. (At the same time, his presentation screen likewise went dark, raising chuckles from the audience. It wasn't for effect, he said, but because he hit a certain button that he'd been warned about.)
What enabled Lagasse to survive Katrina was a practical plan forged by the trial and error from the previous hurricanes. "I can learn if I'm hit over the head by things and that's what happened in this case," Lancaster said. When Hurricanes Isadore and Lilli hit in 2002, the company's disaster plan included assumptions that didn't pan out. Things were better by the time Ivan hit in 2004, when the company was forced to declare a disaster and run its operations from the Chicago backup site for five days.
"There's no better test than actually doing it and running your business that way," Lancaster said of the Ivan experience. "This wasn't testing. This was real live fire."
One of the lessons learned was the importance of coming up with the tiering strategy that dictates the order in which applications are brought back online following a disaster. Lancaster sought to come up with tiers that are easy to understand and communicate to the business side.
Tier 1 applications are those specifically required to generate revenue. For Lagasse, that means the ability to take, pick and ship orders. The goal is that such applications suffer no more than 15 minutes of data loss and be recovered within six hours. "That was deemed acceptable by the business, especially considering we're maintaining a low cost profile," he said, noting the organization's IT budget was just 0.8 per cent of revenue.
These applications should also be recoverable through semi-automated means and without assuming that specific, highly knowledgeable people are available. The use of scripts and detailed documentation meant personnel with good IT knowledge would be able to recover the resources, but it didn't necessarily require the same people who work with them every day, he noted. Applications such as the company ERP system were consistently replicated to the Chicago site via a 3Mbps frame relay link.
Tier 2 applications are those that have to do with the customer experience. Essentially, that means anything that customers would notice if it were down, such as online order entry and various reporting applications. For these applications, the company is willing to lose as much as 24 hours of data and live with a recovery objective of three days. Less automation is involved in recovering these resources and it can be difficult without specific IT staffers.
At Tier 3 are computing resources used only internally, that won't be noticed by anyone outside the company. "They'd only hurt us," Lancaster noted. The IT group makes no specific commitment as to when it will recover Tier 3 applications, he said.
"Spending a lot of money and adding a lot of complexity to become very good at recovering Tier 3 applications really wasn't very value added," Lancaster said. "We'd rather hit the Tier 1 and Tier 2 [applications] 100 per cent and worry about the Tier 3 when the time comes."
Post-Katrina, Lancaster said some adjustments were in order in terms of how applications were classified. Financial systems, for example, fit the Tier 3 definition. But Katrina hit in late August, and September is the last month of the quarter. By September 8, Lancaster was hearing from the CFO about Securities Exchange Commission regulations.
Likewise, e-mail was originally classified as a Tier 3 application. But in the wake of Katrina, "We found e-mail to be about the most valuable communication tool we had at our disposal," Lancaster said. "It very quickly escalated to Tier 1."
Another key to Lagasse's successful disaster recovery plan was keeping its application architecture simple. Whenever possible, his group strives to be involved in defining the solution to a business need, rather than having solutions forced on it. "When an application gets forced on you, it often has architectural principles that are not aligned with what you do, and so you're not very good at [supporting the application]. It just makes things a lot harder," he said.
A few specific technologies were also crucial to the Lagasse recovery effort: VoIP, VPNs and thin clients. VoIP enabled Lagasse to create call centers virtually anywhere, including in its shipping facilities and warehouses, by simply dropping phones in. Call agents could go to these facilities and appear to be in the same call queue as teams in the company's traditional call centers, he said.
Likewise, with VPNs and a Citrix-based thin client capability, displaced staffers who had access to an Internet connection could become productive again. "Every user who had a laptop became a productivity worker the instant they could find a wire," Lancaster said.
The people part
One of the more difficult aspects of coming up with a disaster recovery plan is accounting for individual employees after disaster strikes. "The people element is largely missing in every conversation I've ever had about this subject," Lancaster said.
When companies perform disaster recovery tests, it normally involves booking flight reservations and hotel rooms months in advance. As the event draws closer, staffers argue about where to get drinks the night following the event. At the event itself, everyone gathers around a big table and lets each other know when their bit is complete, so the next step can begin.
"That's not how it really happens," Lancaster said, noting he learned from the experience of those earlier hurricanes. "In 2002, when we asked associates to take part in disaster recovery, they said the first thing that they should say: 'I've got a husband and two kids or a wife and a kid and two dogs and I've got to do things, I've got to take care of things.' The company just fell off the priority list."
By 2004, Lagasse had strategies in place to ensure that it wouldn't ask employees to go anywhere until their families were taken care of, either by moving them to a safe location or letting them accompany employees. This was a powerful step that eliminates a lot of scrambling when a disaster occurs, enabling faster decision-making, he said. After Katrina, Lagasse employees scattered from New Orleans to areas where Lagasse had a presence -- including Chicago, Atlanta and Philadelphia -- and to areas it didn't, such as Tennessee, Texas and other parts of Louisiana and Florida. In some of those areas, Lagasse had sites where employees could gather while in others they worked out of homes, hotel rooms or Internet cafes.
In the end, it was those employees who made the disaster plan work. "All plans fail in the face of the enemy. We ended up with associates having to make decisions on the fly, and having to make risky, very difficult decisions on the fly," Lancaster said. "And the caliber of those people greatly determined how effective those decisions were. So hiring and development is very important."