"Data mirroring is like teenage sex. Everybody says they're doing it, but they're not." So says Charlie Miller, head of data center engineering at KeyCorp, which learned that lesson the hard way over the past two and a half years.
The Cleveland-based bank's data-mirroring odyssey began in the fall of 2000 with a seemingly straightforward business decision to upgrade the disaster recovery capabilities for its entire mainframe production environment, including its online banking system. At the time, the bank's best-case data recovery scenario using an in-place tape backup system was close to 72 hours, with one to three days' worth of data loss. KeyCorp's goal was to be back in business in 12 hours or less, with near-zero loss. That made remote data mirroring the only technical option.
"Technologically, it's a pretty simple concept," says CIO Robert Rickert. "At the same time that an ATM or other transaction comes through, it's written to our disk (at KeyCorp's data center in Albany, N.Y.), and then so many milliseconds later, it's written to a disk down (at an IBM disaster recovery facility) in Gaithersburg, Md. If something blows up in Albany, you flip the switch, connect up a processor, and you're ready to go."
But as Rickert and other KeyCorp officials discovered many times throughout the project, there can be huge gaps between technical concept and reality.
For starters, "there were allegations -- by all of the (bidding) vendors -- that there were a lot of people doing data mirroring, but we had trouble finding people doing even a rudimentary variation of what we were trying to do," says Miller. Plenty of financial services companies mirror data between data centers in New York and disaster recovery sites in other parts of Manhattan or in New Jersey, for instance. "But we were trying to go 500 miles, and at the time, we had 9TB of data and our write I/O was 100MB/sec.," Miller says.
IBM Corp.'s Extended Remote Copy (XRC) mirroring technology, which KeyCorp ultimately chose to implement over offerings from Hopkinton, Mass.-based EMC Corp. and Tokyo-based Hitachi Ltd., was also much less mature and less stable than KeyCorp officials say they were led to believe. On the other hand, the bank said it underestimated the undertaking's complexity and didn't take full advantage of some resources that could have helped during the rollout. As a result of those issues and KeyCorp's lack of experience with the new storage technology, it took more than two years -- a year longer than expected -- to complete the project.
During that time, KeyCorp suffered at least three mainframe crashes and reboots, several production-system outages, agonizing software-tuning problems and a self-imposed project "chill period." KeyCorp then solicited a second round of bids from vendors, including IBM, which had won the original contract in part because it had agreed to also provide the implementation services that KeyCorp needed to set up and run the data backup system.
Making matters worse was the fact that the project had no return on investment. "This was really an insurance-policy type of project and not a project that would result in financial savings," Rickert explains.
The good news is that today, after reaffirming its choice of IBM in a second round of bidding, KeyCorp is in full production with a new backup system consisting of IBM Shark DASD (direct-access storage device) disk drives with XRC mirroring technology at both the bank's data center and IBM's disaster recovery facility. Five dedicated OC3 lines link the two sites, with KeyCorp mirroring between 6TB and 8TB of data -- the equivalent of millions of transactions per day.
The bank ran a disaster drill last summer to test the system and was able to recover its mainframe environment in under 12 hours with "milliseconds of data loss," says Rickert. KeyCorp has set a data-mirroring delay threshold of five minutes for the Gaithersburg site. After that, the mirroring system must suspend cleanly and write data to a different backup system rather than slow down and stall the mainframe production environment.
But getting to this point wasn't at all easy, according to everyone involved with the project. KeyCorp encountered several thorny problems along the way, but the IT team was able to solve them.
KeyCorp's tape backup system was set up to restore critical applications, such as core accounting systems, on a priority basis in the event of a disaster. Back-office systems were secondary. But with the advent of online banking and other real-time, Internet-based transactions, "applications we thought were less critical had feeds into the more critical applications," says Rob Bellanti, vice president of data center engineering. Yet mirroring all of the data would be expensive and slow down the entire system.
KeyCorp solved the problem by configuring the XRC mirroring software based on data set characteristics. Data files characterized as temporary work files and report files, for instance, are excluded from mirroring.
Synchronizing the IBM DASD drives in Albany and Gaithersburg was a major problem. The data files to be written in Gaithersburg would back up in the queue, accumulating in a 16GB cache system at the IBM facility. This triggered KeyCorp's production system in Albany to slow down and crash, forcing an initial program load -- the mainframe equivalent of a reboot.
"That's a pretty serious event when you have to take down your mainframe, essentially hitting Control-Alt-Delete," Rickert says. "We'd have to break the mirror for a while when tuning (of the XRC software) would be done. We'd revert to tape when the mirror was broken. We never stopped doing tape backups, because we wanted to have at least six months where this thing worked perfectly before we'd give up doing tape backups."
The Right Choice
In April 2002, KeyCorp conducted the first test of its mirrored environment, which confirmed that it had selected the right technology. "It was the first time our line-of-business and applications groups were able to access mirrored data to make sure it looked like the data at home (in Albany) looked, and it did," Trent says. "The test actually cemented the fact that the concept of mirroring was going to work and achieve our desired business goals."
Nevertheless, KeyCorp then instituted the chill period and a second bid solicitation from vendors because the IBM system repeatedly and unexpectedly suspended data mirroring as a result of what later was discovered to be improperly set performance parameters. Suspending the mirror caused the bank's mainframe production system to slow down, which in turn made critical services, such as Internet banking and automated teller machine services, unavailable to customers. "The solution was right, but the tuning was off," Trent says.
Miller says two factors had to be addressed. The first was properly configuring and sizing the system's hardware. The second was determining the right parameters for how much slowdown the system would tolerate in the production environment before suspending the mirror.
IBM project manager Esperanza Murdock, who was brought into KeyCorp after the two-month chill period, attributes many of the bank's problems to the inherent complexity of data mirroring coupled with a lack of communication between IBM and KeyCorp officials at the outset of the project.
"There was a preconceived notion that this was a turnkey solution as opposed to an integration project," Murdock says. "On a large-scale project like this one, there are so many pieces to the puzzle. A big piece is the network, but there also are the storage machines, the connectivity between machines, the automated software and the resources available. At the beginning of this project, there wasn't a level set of requirements and expectations and what the solution was all about."
To solve the problems, KeyCorp and IBM made several changes. First, they doubled the capacity of the cache system to 32GB. They also increased the number of data movers -- the systems in Gaithersburg that poll the bank's mainframe environment in Albany for data updates -- from four to five. And they added another OC3 channel to increase data throughput. The two companies also implemented IBM's Geographically Dispersed Parallel Sysplex (GDPS) software tool to automatically manage the data movers and the XRC environment. This eliminated the need for operators to manually enter and change commands, which in turn eliminated the possibility of human error.
"Once we brought in the GDPS software and took advantage of their automation constructs, we suspended a lot less," says Trent. "Before, we conservatively set the thresholds too low, and it would trip the system, and the mirror would suspend, and it would take a few hours to figure out why we'd suspend, and we'd have to talk about the best time to resume the mirror without problems."
In retrospect, "we hadn't really done our homework and gone to IBM's Red Book," a compendium of user experiences that IBM maintains for each of its products, Trent says. "We found ourselves reinventing thresholds and parameters that were already denoted by other implementations of the (XRC) software. We should have paid more attention to those things upfront. We all like to look at our own environments as if they're unique."
Because of the inherent complexity of data mirroring, Phil Poresky, an analyst at GlassHouse Technologies Inc., a research and consulting firm in Framingham, Mass., advises users to spend more time evaluating the support staff that will be implementing the technology, rather than the technology itself.
"Mirroring technology is mature, but salespeople do not like to tell you about all of the challenges and contingencies," Poresky says.