FRAMINGHAM (07/03/2000) - A clustering configuration is like a Hummer. In both cases, the hardware has to be bulletproof, the software reliable and the combination robust. But in neither case should you operate this heavy machinery unless you absolutely know what you are doing.
Clustering links individual servers physically, and coordinates communication so they can perform common tasks. Should a server stop functioning, a process called failover automatically shifts its workload to another server, providing continuous service. In addition to failover, clustering can also offer load balancing that enables the workload to be distributed across a network of linked servers.
Microsoft partners with several hardware manufacturers to offer turnkey cluster server systems based on Microsoft Corp. Windows 2000 Advanced Server's high availability features. We used a Compaq Computer Corp. cluster for our tests.
The hardware, software and cabling were pre-installed by Compaq technicians and then shipped to West Virginia University's Advanced Network Applications Lab.
Once on site, Compaq engineers set up and configured the cluster server.
The purpose of our test was to determine the ease of operation, the reliability and the ability to manage this cluster server package. We tested the ability to do rolling upgrades, network load balancing, failure detection/recovery and failover/failback.
There were things we really liked about Win 2000 Cluster Services. The software was easy to install and made adding or removing servers from the cluster an easy task. The management console was intuitive and offered good information on the state of the cluster. However, we found some areas where improvement is definitely needed. Much of the failover and restore operations require manual intervention. Alerts concerning failed hardware are not automatically presented to the administrator. When a drive failure occurred there were too many dialogue boxes to dismiss before service could be restored.
The unit tested was a sizable seven-foot, 19-inch rack-mounted system with six ProLiant 1850R servers and a separate 1850R running as a domain controller.
Each 1850R had a 600-MHz Pentium III processor. Three of the ProLiants were running as individual web servers.
The rack was large but handsome. The rack was painted cream with a rounded see-through front door that allows you to view each of the system status lights. The front and rear doors have locks to keep unauthorized hands out.
Inside the rear door, the wiring was strapped to individual fold-out arms.
These arms normally are folded over and latched, so the wiring is protected and out of harm's way. However, when you want to work on the back of any server, you need to move the latch up and pull out the arm.
The cluster server uses what Compaq refers to as the Distributed Internet Server Array (DISA.) The DISA architecture consists of a core application stack that includes load balancing, application services such as Web servers, data resources, security and management.
Under the clustering infrastructure used by Windows 2000, the clients access the cluster services through a series of IP-based servers that handle Network Load Balancing (NLB.) The NLB software directs the client to a server in the cluster that can accept the session. This prevents any one server from being overloaded by client requests or sessions.
Win 2000 uses a shared-nothing environment where each cluster node has its own memory and disk storage. At any instant, only one node is managing each disk.
Nodes across a common link that is separate from the connection used to provide connections to the RAID or mirror arrays. If a server fails to respond to the heartbeats generated by another node and sent across the common link, the shared-nothing architecture automatically transfers ownership of resources such as disk drives and IP addresses from the failed server to another server.
One of the biggest problems facing server administrators is how to manage and schedule maintenance - particularly when this involves upgrades to the system that result in server downtime.
One of the more interesting clustering features of Win 2000 is the ability to perform a rolling upgrade of the system that lets cluster nodes to be upgraded, one node at a time, so that services and resources offered by the cluster are continually available. This allows administrators to move the system to a new service pack or to upgrade the operating system without disrupting services to users.
We simulated a rolling upgrade by doing a complete reinstallation of Win 2000 and SQL Server 7.0 on one of the cluster nodes. We found there were some preliminary steps needed to prepare the cluster for this procedure. The node being upgraded could be removed from the cluster only if the cluster service is stopped in advance. After doing this, we were able to evict the node from the cluster. We then formatted the drive and reinstalled the operating system and cluster services on the "upgraded" node. Finally, by using the Add Remove Programs Control Panel and adding the clustering application the rebuilt node was able to join the existing cluster. During the upgrade, user access to the cluster server was fully operational.
While the operating system upgrade went smoothly, we encountered some difficulties when we attempted to reinstall SQL Server 7.0. The installer insisted that we needed two files it could not locate. Ironically, both files (vernel32.dll and vdvapi32.dll) were already on the installation CD. We ran the SQL 7.0 failover wizard that sets up the manner the SQL is to be handled in the event of a system failure. After a few false starts to figure out how to assign resource ownership for the shared drives and services, we were able to get the SQL group created and back into the cluster.
Network Load Balancing
Microsoft's NLB software distributes IP traffic to multiple servers, each running within the cluster. NLB creates a single virtual IP network address for all the servers operating in the cluster. From the client's point of view, the cluster appears to be a single server. In theory, each client's request is distributed among the various Web servers.
However, every time we went to the virtual IP address of the NLB cluster, we always wound up on the same Web server in the cluster. This occurred even when we hit the virtual IP address from several Web browsers simultaneously.
Initially, we thought we had encountered a problem with NLB, but it turned out the Web browser was caching the page locally even though we had told it to check for new versions with every visit to the page.
We finally got around this by using Telnet to go to the virtual IP address.
That produced a random distribution of hits to each of the three NLB servers and proved that NLB was indeed working properly.
Failure Detection & Recovery
One of the goals of our testing was to determine if Win 2000 could recover from failures that we generated in both the software and hardware of the cluster server. Mission-critical applications and data should never be offline for more than a minute. Failures should trigger recovery processes, automatically restarting applications or entire server workloads on a surviving machine in the cluster. This process, from detection through recovery, should typically occur in no more than a minute or two. In our testing, we found Win 2000 could take considerably longer to recover (up to five minutes depending on the failure.)We tested this capability by destroying, deleting or renaming files in the Windows NT system folder and messing with the registry. We were impressed by Win 2000's ability to rebuild itself from any damage we tried incur on the cluster. Understandably, we could cause failures by manually stopping certain services such as the cluster services. However, we found it hard to do any serious file damage because Windows File Protection prevented the replacement of essential system files.
We then simulated failures both to the mirrored and RAID arrays. First, we failed the mirror array by pulling out one of the mirrored drives. Although the system continued to function normally, we were disappointed that there was no drive failure warning generated and placed on the console screen.
We found a similar situation with the RAID array. When we pulled out one of the drives in the RAID array, the system stayed up and continued to function normally. However, again there was no notification from Advanced Server that a failure had occurred.
Although the system provided for our testing had redundant drives and servers, they were all connected through a Fibre Channel hub. This is a common oversight and a potentially serious one because it creates a single point of failure. And fail it did - right out of the box, so to speak. As soon as it was installed and turned on, the Fibre Channel hub went belly up rendering the entire cluster server useless until a replacement hub could be shipped in and installed by a Compaq technician the next day. This failure underscored how critical redundancy is within a cluster server. When it comes to mission critical applications, nothing in the system should ever be a single point of failure.
Compaq includes an application in this bundle called the Compaq Array Utility (CAU) which allows you to examine the mirrored drives or the RAID array in either their logical or physical configurations. Running the CAU correctly pointed out failed drives in both the mirrored array and in the RAID array.
However, the CAU is not normally run continuously and therefore has no way to automatically alert the user when a drive failure occurs. The Windows Event Viewer also indicates failed drives, but as in the case of the CAU you must specifically open the Event Viewer to see the failure.
Because these tools are not normally active, they create a potentially dangerous situation. Should a drive failure occur in either the mirrored drives or in a drive in the RAID cluster, the server administrator is not likely to be aware of it. A second failure in either a mirrored drive or a second drive in the RAID array will cause everything to fail. In our system, simulating a second drive failure did indeed bring everything to a screeching halt.
Although it did not come pre-installed on our system, you could install and enable SNMP traps on the drives and other hardware in the system. Although we did not test this, SNMP should trap a drive failure and send it to an SNMP management console.
Failover and failback
Win 2000 provides a flexible system where you can declare a single node, multiple nodes or no node as the preferred owner. The preferred owner designates which of the cluster services - such as SQL and Cluster Services - are under the control of which of the cluster nodes. You set this on a per-service basis and can also manually move disk services - such as cluster groups and SQL -between nodes at will. In addition, you can set how you want failback to occur. The choices are immediate, or after some whole hour. We were disappointed that failback had to be set in one-hour increments because it was not possible, for example, to set failback to occur exactly at 30 minutes or one hour after restoration of the failed system. (Restoration occurs by either repairing the problem or replacing hardware. Failback occurs when something that was previously offline returns to service.)We tested failover and failback by setting Cluster Node 2 as the preferred owner of both SQL and Cluster Services. We then set the failback to immediate for both SQL and Cluster Services. After doing this, we verified that we could manually move SQL and Cluster Services between Cluster Node 2 and Cluster Node 1. After moving both services back to Cluster Node 2, we failed it by pulling its Fibre Channel connection. This disconnected Cluster Node 2 from the disk arrays. The SQL and cluster services automatically moved to Cluster Node 1 as expected.
Next, we restored Cluster Node 2 by plugging the Fibre Channel connection back in. Before we could bring Cluster Node 2 back on line, however, we had to clear a large number of dialog boxes (eight to 10) and manually restart the cluster services. In this state, SQL automatically moved back to Cluster Node 2, but Cluster Services remained on Cluster Node 1. When we tried to move the Cluster Services back to Cluster Node 2, the Cluster Services went offline, and then came back online once again by Cluster Node 1. In other words, the Cluster Services could no longer be moved either manually or automatically by the system. Any attempt to examine the cluster using the CAU also failed because ownership was now split between Cluster Node 1 (Cluster Services) and Cluster Node 2 (SQL.)When we failed Cluster Node 1 by pulling its Fibre Channel connection, we immediately received a message that cluster database "was no longer available."
In other words, we had experienced a complete and total failure that rendered the entire system inoperable.
The solution to this situation requires a fair amount of manual intervention.
By pulling the Fibre Channel connection, Cluster Node 2 could no longer see the disk array even after the connection was restored because the failed node once restored does not automatically rescan the disks as might be expected. A workaround is to manually rescan the disks by going to Computer Management:
Storage:Disk Management and selecting 'Rescan Disks.' Because a reboot forces a rescan of the disks, this could also be accomplished by rebooting the previously failed cluster node. However, a disk rescan takes about 11 seconds.
Therefore, it is faster to rescan than reboot the node.
As we mentioned earlier, the CAU is used for installing, configuring and testing of the RAID array and mirrored drives. Our evaluation system had two services, SQL and cluster services. It's a normal procedure to balance the various services between cluster nodes. However, when we did this, any attempt to run the CAU failed.
We asked both Microsoft and Compaq to address this problem. Compaq pointed to Microsoft's shared-nothing environment as the root of CAU failure problem.
Shared-nothing requires all disk services to run on a single cluster node with the other node running in hot standby. Balancing services between cluster nodes can not be done if the user expects the CAU to function.
At last report, Microsoft and Compaq engineers were working to resolve this issue. In the meantime, the work-around is to move all disk-based services to the same cluster node prior to running the CAU. You can then do whatever disk administration or maintenance is necessary with the CAU. When you are done, quit the CAU and redistribute the load. Microsoft claims running the CAU is not an everyday occurrence, but does admit this manual process is a "necessary evil" of the shared-nothing environment.
Win 2000 ships with a "Getting Started" manual that covers what's new in Windows 2000 Advanced Server. It also contains notes on planning an installation, running the setup utility, upgrading and installing on cluster nodes, system recovery and troubleshooting. More information is needed to run a system this complex. Fortunately, the Microsoft Deployment Planning Guide has detailed information on deploying server clusters and load balancing and is available on Microsoft's Web site.
Although flawed in places, Microsoft and Compaq have made a decent start in providing a turnkey solution aimed at running non-stop mission critical Web, application servers and database applications. Windows 2000 Advanced Server appears to be the Hummer of Windows operating systems and as such, our tests showed that it can compete with established clustered server environments like UNIX, Linux and NetWare.
Jeffrey Fritz serves as the principal network engineer for West Virginia University. He is responsible for the development of advanced networking technology. He is the author of Remote LAN Access: a guide for networkers and the rest of us. He can be reached at firstname.lastname@example.orgThe author would like to thank Ed Norman, Floyd Roberts and Matt Glotfelty of West Virginia University's Network Services Department their assistance with this review.