In May 1999, online auction site eBay suffered one of the earliest and most publicised e-commerce outages of the dot-com era when its servers came crashing down, bringing online bidding to a screeching halt for an insufferable five hours.
What could have prevented the historic network crash were clustered servers deployed as a fail-over to keep the site running in the event of such an outage.eBay's servers were not clustered at that time, and their crash was the type of failure that clustered servers can prevent," says Gordon Haff, an analyst at Illuminata.
Be it with multiple servers clustered together in a "scale-out" fashion for maximum performance or with a single additional server clustered to a primary server to provide fail-over and high system availability, clustering is appearing in business networks more often. Even stand-alone servers with multiple CPUs -- known as scale-up systems -- can be clustered internally to deliver both high performance and high availability.
The challenge remains how best to implement server clustering to meet an organization's needs and budget. Clustering at least two servers for fail-over protection is now a commonplace practice within business networks. But clustering multiple servers has been slower to catch on. Multiple server clustering allows a company to scale out its server capacity by purchasing smaller servers, which are CPU-for-CPU less expensive than larger SMP (symmetric multiprocessing) servers and which can be added on an as-needed basis.
This form of multiple server clustering for extra processing horsepower is known as high-performance clustering and is commonly found in high-volume, task-specific Internet and telco server clustering architectures as well as in pure scientific supercomputer deployments.
The ultimate promise of clustering lies in the combination of both fail-over and high performance. But users and application developers are concerned that not all applications are ready to be clustered reliably beyond two to four servers, an issue that has delayed delivery on clustering's promise.
Tom Devine, manager of system services at Life Line Systems in Framingham, Mass., oversees such a hybrid clustered network. Life Line designs and builds home communicators that replace a telephone with a special speakerphone that can be activated from a wrist bracelet or pendant in an emergency.
Supporting 340,000 Life Line subscribers in North America, Devine's team uses a two-node, or two-server, cluster to deliver both backup protection and added performance. Now in the process of upgrading an older Unix cluster to a Compaq True64 Unix cluster with OpenVMS, Devine says the cluster provides fail-over protection and high-performance capabilities for Life Line's systems.
"The cluster obviously gives you more power," says Devine, adding that because Life Line is often used in emergency situations, there is no room for a network pause. "We have to be 24-by-forever -- we are life critical. We take up to 24,000 calls a day, 2,400 in a single hour. We can't have downtime."
Life Line's Unix cluster delivers both high-performance and system fail-over because much of it is built on expensive, proprietary technology that laid the foundation of modern clustering. But clustering less expensive Intel-based hardware and open-source software like Linux can also act as a lower-cost alternative for companies seeking to squeeze the most out of their existing systems.
MDS Proteomics of Toronto uses Intel-based IBM xSeries server clusters running Linux to compute protein sequence analysis, according to the company's CIO, Chris Hogue. Being able to inexpensively scale out server power as needed via clustered servers is a plus for MDS.
"The price-performance points you hit on a cluster of Linux machines is incredible because they are commodity items," says Hogue. "And you can kick a power cable out of one of the nodes and the [management system] knows where the job was sent and will redirect it to a part of the cluster that is still operating."
Balancing clustering's fail-over and performance benefits greatly depends on what type of application a company is running. For example, straightforward applications such as Web hosting and purely scientific applications cluster well, whereas complex applications such as transactional databases and airline reservation systems are a little trickier to cluster, Illuminata's Haff says.
As a matter of sheer economics, clustered single CPU servers are less expensive when compared to large scale-up SMP servers, says Haff. But this scale-out approach generally handles simple, easy-to-disperse applications such as Web hosting. The need for stability when running data-intensive back-end applications such as database software still has most customers choosing scale-up servers for the database, he says.
The improved clustering capabilities of Microsoft's operating environment is also making clustering more accessible to many non-Unix customers, and similar to using Intel chips as opposed to RISC processors, it can bring down the cost of clustering.
Roj Snellman, director of IT at Melbourne, Fla.-based Intersil, a manufacturer of silicon technology for WLANs (wireless LANs), uses two two-node Microsoft SQL Server clusters to support four two-node clusters running Microsoft's BizTalk Server EAI (enterprise application integration) and b-to-b integration tool.
"Clustering is a much more cost-effective way to achieve always-on systems," Snellman says, adding that he uses clustering fail-over capabilities to introduce previously tested changes into the production environment. "Most of our fail-over is done as we make changes. The clustering capability lets us bring down one machine to make changes without affecting the application," he explains.
While Microsoft has handily dominated the Transaction Processing-Performance Council (TPC) benchmarks since the release of Windows 2000, SQL Server 2000's scalability clustering is not without critics. But John Enck, an analyst at Gartner, defends Microsoft's offerings by pointing out that a majority of applications available today weren't written to be clustered. Application developers need to step up and make it easier for customers to cluster applications atop Windows, he says.
"Microsoft has made a lot of investments in clustering, but they can only do so much," Enck adds. "Microsoft has made a good foundation for clustering, with the tools and APIs. The rest is just a matter of timing."
Rolling Stone magazine, for instance, has been using SQL Server for four and a half years to run its Web site, according to Andy Rice, director of technology at the New York-based music magazine. Rolling Stone uses a configuration with 10 SQL Server machines, distributed on both coasts for redundancy.
"We haven't had the whole site go down," Rice says. "Most of the boxes never even go down unless I am doing an upgrade."
RollingStone.com's concerns are familiar to many companies -- with more business done on the Web, all servers simply must not go down at the same time. Online performance delays and site outages are bad for business, not to mention bad PR.
"For years we talked about high availability. Now we're at the point where companies need continuous availability," adds Philip Russom, an independent industry analyst in Waltham, Mass.
Who wants what?
Vendors preach of clustering more and more nodes, but customers typically still use the basic two-node configuration.
"This argument that 'My system supports more nodes that your system' just isn't grounded in what customers want," Russom says. "Most people feel the likelihood of one node failing is small, so most people think that [clustering] two nodes is enough."
Intersil's Snellman says that scale-out clustering beyond the two nodes he already has may not be worth the trouble.
"There probably would be some value in a cluster with more nodes, but that's an architectural decision that is more specific to a particular application," Snellman explains. "Clustering gets pretty complicated, and complexity almost always has some element of danger."
Gartner's Enck believes that more companies would take advantage of scale-out clustering if more applications were designed to run across multiple small servers. But the fact is that "most of the application vendors write applications that are not capable of being clustered in a high-performance cluster," a condition that keeps most mission-critical database applications running on historically reliable SMP servers with only a single server clustered for fail-over, says Enck.
But the major database vendors, including IBM, Oracle, Microsoft, and Sybase, continue to push their ability to cluster better than the others.
Earlier this month at OpenWorld, Oracle showcased its RAC (Real Applications Cluster), which is essentially the Redwood Shores, Calif., company's scale-out approach. Unlike Microsoft's approach to clustering, where application workloads are carefully distributed across all the server nodes, Oracle's RAC hosts the application in a single server and assigns additional servers computing tasks through a shared cache memory.
In doing this, RAC enables users to plug in a new system for increased scalability without having to reconfigure the workload across the entire set of machines, according to Ken Jacobs, vice president of data server product management at Oracle -- also known as "Dr. DBA" to the Oracle devout.
"There's really only one real-world clustering solution for out-of-the-box applications, and that's sharing cache across the configuration," Jacobs says. "It's a deep technical and philosophical argument. 'Shared nothing' only works if you're willing, philosophically, to limit yourself to using applications that are intended and customized for the shared-nothing clustered approach."
But neither Oracle's nor Microsoft's approach to clustering is the end-all solution. "There is no panacea for scalability clustering," says Jeff Ressler, lead product manager for Microsoft's SQL Server.
Ressler says that in the majority of scenarios, the scale-up clustering approach works best for customers. But companies in transaction-heavy environments which tend to spike -- such as e-commerce companies -- find that scaling out enables them to be more flexible with their systems: They can allocate systems to the Web site when they expect more action that usual, and then reallocate those systems when the peak period ends.
Compaq is also working with Oracle to bring simplicity to clustering servers for databases, and the company already offers preconfigured, clustered server hardware based on Oracle's Certified Configurations software, says Mel Lewandowski, marketing director for Compaq's high-performance division.
"One of the things we are working on with [Oracle] is to be able to increase the database size by adding a node," says Lewandowski. "Today, you buy bigger and bigger SMP systems. But one of the advantages [of clustering] will soon be that I can simply add on another node." This ability to scale out a server cluster with smaller server nodes adds a flexibility and a nimbleness to clustering than can spare customers the expense of adding large, expensive SMP servers every time they need to add capacity.
"That nimbleness is getting increasingly attractive. Rather than buying a giant SMP cluster, they buy a couple of nodes when the Christmas orders rush in," Lewandowski notes.
According to Dan Vlamis, president of Vlamis Software, a consulting shop in Liberty, Mo., that specializes in Oracle software, clustering's challenge for the future will be to build on companies' interest in the technology and turn it into implementation.
"Customers are interested that clustering capability is there. Do they have immediate needs for it? Probably not. But they're interested in something they can grow with, and they want to know it's there, even if they don't want it right away," Vlamis adds.
Going beyond clustering
Still largely the domain of academia and computer scientists, grid computing takes clustering one step further by linking multiple, geographically dispersed servers and enabling users to plug in and access software, services, and computer resources as if the grid were one virtual supercomputer.
Although issues such as security and manageability still stand in the way of grid computing as an enterprise business architecture, companies such as Sun, IBM, Compaq, and Hewlett-Packard have each had significant grid computing wins within scientific and academic environments.
IBM recently announced a massive computing grid centered at the University of Pennsylvania. The U.-Penn grid will connect the servers and databases of hospitals around the world to share information on mammogram procedures and research, explains Dave Turek, vice president of high-performance computing at IBM.
"Part of the issue is the raw volume of data each of the hospitals engaged in collecting mammographic data will generate -- up to 7TB of data a year -- so five or six hospitals alone will generate the equivalent amount of data in the Library of Congress," Turek adds.
Participating hospitals will have access to all of the data available on the grid without having to individually shoulder the massive infrastructure burden of all available data, Turek says.
"The volume of data by itself is awe-inspiring. As a result [of the grid], the data will be kept regionalized so you don't have this tremendous infrastructure burden," Turek continues.
In addition to the need for increased security and ease of management in distributed grid computing networks, another potential barrier could turn out to be the availability of IT talent, which Brent Sleeper, a partner at San Francisco-based research and consulting firm The Stencil Group, calls a requirement for the successful implementation of grid computing.
"All of this places much more emphasis on high-level architectural skills than on a checklist of programming languages on a resume," Sleeper says.
"It also continues the trend toward many of the programming skills that we've seen get a premium in the last few years: object-oriented development models, emphasis on containers like the J2EE [Java 2, Enterprise Edition] framework, and so on. And security, authentication, reliability -- all of these things that we've often swept under the rug of our internal systems -- must be addressed in any kind of distributed programming model," Sleeper notes.