Analysts discuss key scalability and availability issues. Roundtable participants: Dr Kevin McIsaac, senior research analyst server infrastructure strategies, META GROUP; Peter Steggles, senior analyst systems IDEAS INTERNATIONAL; Peter Hind, program manager InTEP Forum, IDC AUSTRALIA.
1) What is your definition of systems or solutions scalability, and do you see this as a key issue for IT planners and managers ?
McIsaac: Academics define a highly scalable platform as one capable of extracting near-linear work increments out of each additional component, such as, second processor gets 2x, third gets 3x. We define a scalable solution as one that can take additional resources to enable the addition of users to a workload without affecting response time or significantly increasing the system's administration requirements.
While scalability is a significant issue, we believe IT organisations must make infrastructure agility their primary design goal and value discipline. To make this transition, IT shops must focus on three requirements for infrastructure agility: architecture, technology, and skill. Focusing on just one or two of these requirements may work in the short term, but will quickly lead to the new infrastructure becoming legacy infrastructure, for instance, that which significantly resists modification, evolution, or integration.
2) How can organisations be caught short in their scalability readiness ? (Can you provide any examples. You could consider servers, hosts, network architectures, e-commerce applications, e-mail hosts, database applications, mission-critical applications, operating systems)McIsaac: Storage growth will be a major challenge for the next five years. E-business, e-mail and rich media (graphics, audio and video) is driving storage capacity growth to 100 to 150 per cent. The crucial question is how to operationally manage the data growth with minimal increase in staffing. Backup and recovery (B&R) services is another area that will see significant developments during the next 24 months, as application recovery time requirements approach zero.
Users should avoid a "one size fits all" approach and an over reliance on traditional scale-up architectures. A robust, scalable server infrastructure will require a best-of-breed approach. IT groups should examine a "strength in numbers" scale-out approach for networks, firewalls, application server and Web server layers.
With IT budgets remaining tight through 2002, organisations should consider using multiple small Web servers and application servers to extend scalability at a low cost. This is both cheaper and avoids over-sizing or under-sizing the infrastructure.
3) What recommendations would you make to IT managers so they could ensure sufficient, cost-effective scalability in key IT infrastructure and applications?
McIsaac: Scalability and availability must be approached from two dimensions - scale up and scale out. Despite some vendor marketing hype, we continue to recommend scale up for back-end DBMS OLTP servers, where a single instance database of record must be maintained. However, a scale-out approach is practical for middle-tier application and Web servers. Users should note that scale out is primarily an application architectural issue, such as, distributed functionality and presentation services with minimal state maintenance.
IT organisations are implementing scale out via server farms with large numbers of smaller (one to two CPU) systems. We believe this is primarily due to several factors - immature (for Unix) or non-existent (for NT/Win2000) partitioning tools, non-linear server pricing, and human nature. It is simply easier to "rack and stack" numerous Web and applications servers. The capacity upgrade options are more granular, and a larger server typically costs two to three times more than a comparable number of CPUs in multiple smaller systems.
4) What are the main IT system availability challenges that organisations face? (with regards to infrastructure such as host servers, networks, storage, operating systems, or services such as telecommunications)McIsaac: Selecting a high availability (HA) solution is like choosing insurance. The premium is weighed against the possible loss of life or property. Selecting a cost-effective HA solution is the product of weighing the cost of the HA solution's hardware and software against the cost of the loss of the application for some periodThe definition of HA is crucial in determining what problem is being solved. Availability is always a user-defined term. If the application is performing poorly, is it unavailable? If mobile users cannot connect because the phone company's cable was accidentally cut, is the application unavailable? If a table in the database is currently offline, is the entire application unavailable?
Platform vendors are quick to provide availability numbers; many have been pointing their IT customers at five nines (99.999%) as a meaningful way of defining customer availability. The five-nines goal generally relates to platform hardware with meagre amounts of system software; however, users at the end of complicated network and multi-layer application suites are lucky to see three nines.
Exploding data growth will eliminate basic recovery as an option (due to unacceptably long time frames), forcing IT organisations to rely on replication and clustering. DBMS vendors will provide improved point-in-time recovery options (such as Oracle's Flashback Query) to eliminate the need for full restore.
Although technology's price/performance has improved, the decision-making process remains predominantly a business decision, not a technology one. Best-practice IT organisations should assist business owners to make appropriate HA decisions based on business reality.
5) What analysis do you recommend organisations complete prior to investing in appropriate system and solutions availability measures?
McIsaac: The continuous availability and rapid recovery that is needed for high availability demands considerable effort on the part of operations groups. Success rides on system design, automation, and application design. In most cases, operations have experience with automation, but very little with either system or applications design. To make the leap, it must add new skills via training or hiring. This transition revolves around Meta Group's belief that operations need Level 1 and 2 technical support skill in their centres of excellence.
The high availability process drives failure-proof, rapid application recovery plus availability management. Continuous operations drive non-disruptive changes, non-disruptive maintenance, and online application maintenance. All these items will require operations groups to partner with application developers as well as tech support and engineering and be an equal partner. Just as important, operations and its customer advocacy group must also build more complete quality-assurance-like scripts for analysing applications.
6) How should acceptable availability performance levels be determined, and managed (such as service level agreements)?
McIsaac: IT has traditionally defined availability using averages, like the famous "five nines"; however, this measure provides little real value to customers. For example a 99.5 per cent, available translates into about 50 minutes a week of service unavailability. If this occurred in one hit, is this acceptable to the business?
More meaningful parameters to describe availability performance are mean time to failure (MTTF) and mean time to repair (MTTR). We believe successful IT groups will move to these approaches during the next three to five years as smarter customers become painfully aware of the "availability average" hoax.
Instead of providing 99.5 per cent average availability, operations teams can offer 20 days between failures and 20 minutes time to repair at a predefined cost. This kind of performance contract is easier to turn into business-speak and has real operational teeth.
Moving to mean-time availability will force IT to rethink recovery and production transition procedures, driving operations to a more valuable service model and a thorough housecleaning of 20-plus years of availability mythology and sacred cows.