SAN FRANCISCO (04/28/2000) - Horror stories of system administrators' mistakes, overlooked system aspects, and misconfiguration abound. Sometimes these factors combine to create more downtime than any one does individually; sometimes, in an attempt to correct a problem, administrators make additional mistakes that cause still more downtime. This month, I'll offer you advice on how to keep your Sun Microsystems Inc. server from failing. From the sublime to the ridiculous, all of these rules are gleaned directly from real-life experiences.
In each case, had the rule been known and followed, there would have been less downtime -- or none at all.
For the purposes of this column, a cluster is defined as two or more general-purpose systems that monitor one or more services via a connected network and storage facility. Clusters keep services available by moving them from failed to working members of the cluster.
Although clusters are the most obvious way to maximize uptime and achieve high availability (H/A), you can further increase your system's performance by adhering to the following H/A guidelines. Even within a cluster, these guidelines can improve facility availability, reduce the number of failures, and promote faster recovery from those that do occur. The fewer failures, the fewer times a cluster will be forced to resort to failover services (i.e., those services that help a system recover from an error). These guidelines can also help track down vulnerabilities within a facility. As always, your contributions of new guidelines are welcome.
Guideline 1: Choose an appropriate level of availabilityTwo main factors determine a sysadmin's choice of an appropriate level of availability for a given facility: need and cost. More availability costs more money. Determine how much downtime you can tolerate, including the frequency of downtimes and the minimum time allowed to recover from failure. The result yields the number of single points of failure (SPOF) your facility can have, and the level of H/A required to meet the goal.
To determine which components should be clustered and which merely need H/A, follow these simple guidelines. Generally, H/A techniques are used where clustering cannot be. For example, clustering cannot be done for the network infrastructure, or for systems that don't maintain their state on disk (as disk is used to move the state of a service within a cluster). H/A techniques are also used for components that are important, but not critical. (If the facility will still function if a component fails, it's important; but if performance will be degraded, service temporarily interrupted, or a user state reset, it's critical.) Guideline 2: Remove SPOFsSingle points of failure are common in most facilities, and as such are difficult to completely remove. There are machine aspects (single CPU, network interface, etc.) network aspects (single network cable, single switch, etc.), and facility aspects (single Web server, single Internet link, single power connector or power grid, etc.) to the problem. As a general rule, the number of SPOFs in each portion of your system should be more or less comparable to the number in other portions. For example, having machines with zero SPOFs connected to networks with many means that you've spent too much money on systems and too little on networking. You should decide on the level of fault tolerance required by your application, and make sure that level is achieved at the machine, network, and facility levels.
Note that SPOFs can be complicated to remove. However, by following the data-flow paths and drilling down into the functionality of individual components, potential problems can be spotted.
Sun computers don't automatically use a second network path if the first path fails. This can be enabled on enterprise servers by using alternate pathing, network trunking, the NAFO feature in Sun Cluster, or even by writing some simple scripts that periodically run on the machine. None of these, however, are enabled by default.
Consider a pair of load balancers runs with one active load balancer and one passive load balancer . The second load balancer becomes active if the first becomes inactive due to failure. Now consider what happens if the cable between the upper load balancer and the lower Web server fails. The Web server still has a connection to the load balancers, but the working connection is to the lower balancer, which is passive and won't pass packets. Therefore, one half of the Web server capacity would be lost in this configuration because of a simple wire or Ethernet interface problem. To correct this, inject a pair of switches to cross-connect each Web server to each load balancer with a reliable patch.
Much of the rest of this article will show ways you can reduce SPOFs and increase individual system resilience.
Workstations are SPOFs; they simply aren't resilient to failure. Uptime can be increased via external storage or by mirroring a system disk on a workstations with two bays.
Workgroup server SPOFs
Sun workgroup servers are more reliable because they have fewer SPOFs. Mirrored disks are the first step, because they can be hot swapped, which reduces repair downtime. When properly configured, the systems also have N+1 power, which means that a single power unit can fail and still leave the system functional.
During power surges, either all or none of the power supplies tend to be affected, and only some of Sun's computers and storage -- E250, E10000, A3500, A1000, and D1000 -- can be dual-attached to power outlets for plug redundancy.
Enterprise server SPOFs
Sun enterprise servers can be configured to provide very few SPOFs. They include N+1 power and will recover from system card, I/O card, CPU, memory, or I/O controller failure, typically by crashing, marking the component as down, and rebooting with that component offline. All of those components must be configured as redundant for that RAS (reliability, availability, servicability) feature.
The system needs dual system cards to recover from a single failure, and dual I/O paths for recovery from an I/O problem. For example, it should have two QFE controllers on two I/O cards, each connecting to each production network.
Likewise, storage should be dual-attached using redundant I/O cards and controllers. Dynamic reconfiguration and alternate pathing (DR/AP) features should be enabled and configured so that alternate paths can be automatically invoked. The only unavoidable single point of failure in enterprise servers (except for the E10000) is the backplane. If that goes, the system is down. In the E10000, the only SPOF is the control board, but two of them can be configured into the system with manual failover if one fails.
Network and power SPOFs
Many a redundant facility has failed due to supply-chain problems. Consider, for example, a site with dual T1 links to the Internet, each from a separate T1 provider, with BGP used to manage traffic on the links, the cables ending in separate routers connected to a pair of redundant switches, and so on. Both T1s probably follow the same path outside, where they're strung along the same set of telephone poles. From beginning to end, the data and power chains of your facility need to be appropriately redundant and separate.
Guideline 3: Redundant storage
To automatically survive a single disk failure, all storage must be RAID configured; manual recovery on boot disks can be done via disk duplication.
Unfortunately, RAID only solves the disk problem, not disk bus, controller, cache, or disk array power problems. To confront those issues, two or more devices are needed for each storage component. Mirrored boot disks should be in separate devices with separate power, separate I/O cards and boards, and separate bus cables. The new Sun D130 will be perfect for this use (it's one rack unit in height, holding up to three Ultra SCSI disks). In the meantime, D1000s or multipacks are the best alternatives. For nonboot storage, duplication is the watchword. Some sites with admirable levels of paranoia will take A3500 arrays, strip the data within the array, and mirror it to a second array. This provides a pricey but reliable storage facility. Avoid daisy-chaining storage if uptime is a goal.
Hot spare disks are used to rebuild a RAID set when a RAID member fails. Hot spares can be configured at the hardware RAID level (via RM6 for Sun RAID arrays), and at the volume management level (Solstice Disk Suite or Veritas Volume Manager are typically used). Be sure that you understand how to use the hot spare with your system before configuring it. For instance, with Veritas Volume Manager, a hot spare is only used within its disk group.
Another storage issue is recovery from disk failure. For RM6-managed hardware, the RM6 recovery guru gives systematic advice on how to repair a problem.
Trying to repair a problem manually without the guru can lead to serious complications. Likewise, A5X00 storage must be managed by the luxadm command.
Don't change disks at random without alerting the system via the appropriate facility. The added complication is that VxVM is sometimes layered on top of RM6 or A5X00 storage. In these cases, follow the documents or get help from the vendors. Doing steps in the incorrect order (recovering in RM6 before fixing things in VxVM, for instance) can cause more damage than the original problem.
Finally, be sure to ground all storage (and systems for that matter) according to the installation documents. This is especially important in clusters, where storage is dual-attached to systems. Storage failures have been traced to poor grounding in these circumstances.
Veritas Volume Manager
To increase RAS and decrease SPOFs, a few aspects of Veritas volume manager should be avoided. Dirty region logging and RAID 5 logging are options when VxVM takes care of your RAID management. These features use minimal disk space and maintain a bitmap of the RAID system's state, which allows for efficient problem recovery. They are also crucial to data integrity, so use them or risk losing data. Log devices should be on disks separate from the data they're logging. For example, you could mirror a pair of disks and log to a separate disk, the remainder of which could be used in another mirrored pair.
One other Veritas issue: VxVM's dynamic multipathing feature (DMP) can conflict with Sun's DR and AP features, resulting in disks that logically disappear from a system even though they're still physically attached. These problems seem to be resolved in newer VxVM releases (2.6 and beyond). Still, it's worth checking with Veritas before doing a new install, especially if you're having disk access problems on a current system.
Guideline 4: Patches
Patches, which include firmware updates, are the bane of system administration, both because of the time and effort it takes to manage them and because of the risks inherent in installing them in production environments. To reduce the risks, implement and test them in a staging environment that's as similar to production as possible. For instance, if production uses enterprise servers, so should staging. Likewise, storage types should match. Most sites seem to have fire-and-forget patch methodologies, so install patches before production, and only update them if a problem is found.
Guideline 5: Monitoring
At the risk of stating the obvious, I'll say that all aspects of your facility must be monitored for failure. Clusters have been known to fail because no one noticed when the first node crashed and failover occurred. Eventually, the second node crashed and the entire cluster went down.
Your first step is to determine how much reliability is required in a facility.
A little can be cheap, perhaps even free, and is usually easy to implement.
Taking those easy steps will give you more reliability at a low cost. Ensuring higher levels of reliability will require real money, time, planning, and complexity; you should only take these steps if an expert deems them appropriate for your environment. All such measures should be implemented to the same degree on all aspects of the facility, in order to assure that your company's time and money are well spent.
About the author
Peter Baer Galvin is the chief technologist for Corporate Technologies, a systems integrator and VAR. Before that, Galvin was the systems manager for Brown University's computer science department. He has written articles for Byte and other magazines, and he previously wrote Pete's Wicked World, the security column for SunWorld. Galvin is the coauthor of Operating Systems Concepts and Applied Operating Systems Concepts. As a consultant and trainer, Galvin has taught tutorials on security and system administration and has given talks at many conferences and institutions.