Steve Etzell saw for himself how quickly a minor unauthorized change can foul up a Web site. Etzell, director of Web technology at Select Comfort Corp. in Minneapolis, was on vacation when he got a call telling him the bed maker and retailer's Web site performance had gone "into the tank." The reason: A developer had let a business group user "twist his arm" into dynamically generating user-specific price quotes on a Web page that showed an entire category of Select Comfort's products. The site had previously sent users to a cached page that showed the same prices to everyone.
That change "seemed fairly innocuous," Etzell recalls. But the page "is accessed potentially 100,000 times per day . . . and you'll bring the server to its knees" by forcing it to dynamically create the page for each visitor, he adds.
"About an hour later, they realized what they had done and turned the caching back on" for the category page, says Etzell. The long-term answer was to remove prices from the categories page and instead put them on Web pages that describe specific products. Since those are accessed far less often than the categories page, the site can deliver customized pricing without taking a huge performance hit.
It's that kind of unplanned, untested change that Web site managers hate and users love. Managing change on the Web is a "balancing act" between the need to keep your very public Web site up and running and the need to update it often enough to keep it attractive to visitors, says Etzell.
The more important your site, the more reliable it needs to be. The more transactions you do, the more it needs the kind of rock-solid stability once associated only with mainframes in a data center. Keeping embarrassing and costly outages to a minimum requires IT managers to create standard change-management policies, automate them as much as possible and outsource them if they must. Repeatable, consistent procedures, performed either by skilled support staff or automated tools, are the best way to cope with the pressures of a public-facing Web site.
The Web environment is unique because users demand changes within hours, not weeks. Changes to content aren't done by database administrators who first check the validity of the data and its effect on site performance, but by marketing managers. There's no single mainframe vendor to release updates or patches on a regular schedule, but rather a half-dozen or more suppliers that find and fix flaws in their products on their own schedules.
Then there's security, which can require major changes to sites as hackers discover new ways to bring them down. "There's a lot more changes going on in these Web-facing systems, with most of those relating to security," says Jason Lochhead, co-founder and chief technology officer at Data Return Corp., a Dallas-based managed hosting company. "You didn't have to worry so much on legacy systems because they're isolated from public traffic." Microsoft Corp. acknowledged in late January, for example, that its defenses had been inadequate after it was hit by denial-of-service attacks two days in a row. In response, Microsoft planned changes to its network architecture, including a backup set of domain name servers (DNS).
Even routine, planned changes can crash a site if they're done incorrectly. Just days before the hackers hit, Microsoft suffered a 22-hour outage that left many of its Web sites unavailable. The company blamed the problem on a faulty configuration change to the routers on its DNS network.
When Don Ursem compares the reliability of his Web site with that of the telephone system, he isn't kidding. Ursem is vice president of network operations at VocalPoint Inc., a San Francisco-based application service provider that lets consumers access Web sites via phone by converting HTML into voice responses. VocalPoint sells the service to telephone companies and in vertical markets such as health care. For the end user, "it's a telephone application," not a computer application, and "you expect your telephone to work all of the time," says Ursem.
But that's easier said than done. First, there's the volume: VocalPoint leases two T3 data lines, each of which can handle 644 simultaneous incoming calls and needs 135 servers to process them. Then there's growth: As VocalPoint adds T3 lines, Ursem expects that he'll be managing about 650 servers across three sites by June.
VocalPoint rolls out a new release of its voice Web-browsing software every three months and is converting about 30 Windows NT servers to Linux to support a new text-to-speech engine.
Then there are routine upgrades and patches to the databases, operating systems, network switches and EMC Corp. Symmetrix storage-area networks. Each must be tested for its effect on the system, rolled out in a coordinated way and tracked so that if any updates backfire, the offending change can be pulled out of production. And such caution is warranted. According to a survey conducted last year by Framingham, Mass.-based IDC, 46 percent of IT managers said software updates gone wrong played a role in their site outages.
Ursem, a former mainframe data center manager, ended up outsourcing to Intira Corp., a managed service provider in Pleasanton, Calif. The selection came after a grueling examination of seven San Francisco Bay area outsourcers to see how they matched up with his goals of outsourcing and automating change management.
Ursem wanted a service-level agreement that covered not only the servers and network, but also the incoming T3 lines and their links to the servers. He insisted on choosing his server hardware and software, which ruled out many outsourcers that require customers to use standard offerings.
He also insisted that the outsourcer's staff follow written procedures and that he have access to an online monitoring tool to ensure that those procedures were being followed. (For security reasons, Intira won't let Ursem into the data center running his applications.) Ursem demanded and got contractual commitments "that there would be no changes made to my environment without my prior approval," including updates to network switches, storage environments or software drivers.
Intira monitors the operation of its systems with Hewlett-Packard Co.'s OpenView, which would have been bogged down if Ursem had also used it to do continuous, real-time monitoring for any changes in every server.
Using StatePoint Plus, a change-management tool developed by Monroeville, Pa.-based Westinghouse Electric Co. for its own use and now sold to other companies, "I have the ability, from San Francisco, to link into the Intira data center and compare any set of servers against a reference server" to find and investigate any unexpected changes, Ursem says.
"I don't want things done manually by gangs of people," says Ursem. "Then you would suffer from human inconsistencies. I'm looking to reduce that. Anything I can automate, I will. Anything I can outsource, I will."
Old Rules, New Game
Select Comfort has built a multitiered process for making changes to its site, which can get as many as 8,000 unique visitors per day.
It created a content-management application that five people in marketing can use for live updates of information such as product descriptions and availability. But "we really try to keep the control tight," says Etzell.
Select Comfort follows a mix of written and unwritten rules, such as "don't change things at peak use time if you don't have to." This select group of users can make changes either live on the site immediately or to a staging server, where changes can be reviewed before going live. The company also does weekly batch updates of changes, as well as a "major monthly push" in which more complicated functional changes (compared with content-based changes) are put into place, says Etzell.
Like Ursem, Etzell has taken pains to document the change-management procedures for his environment, which includes Windows NT 4.0 servers and SQL Server 7.0 databases, as well as Austin, Texas-based Vignette Corp.'s StoryServer 5. He says he also tries to make sure everyone on staff knows who is responsible for which parts of the infrastructure so they can be notified of changes that might affect them.
The strongest change-management processes, says Etzell, were adapted from those already used by the technical services group responsible for Select Comfort's backbone enterprise resource planning, financial and other systems. These processes cover changes to infrastructure hardware and software, with written test plans before an update is put into service. But even then, "some arm-twisting goes on, and we'll change something on the fly," Etzell says.
Keeping those exceptions to a minimum is part of the art of change management. It's when you try to "short-circuit" your own procedures, Etzell says, that you get into trouble - which can mean a nasty wake-up call for the entire business.
Scheier is a freelance writer in Boylston, Mass.