Always on

During the 1980s and early 90s a company called Tandem built servers that had every component duplicated. It was expensive and required complex technology but the company claimed it would run at six 9s (99.9999 per cent) availability and you could fire a gun through it without it failing.

Six 9s availability translates to a downtime of just fractions of seconds a year - something that a lot of companies are still striving to achieve.

Tandem no longer exists, having been taken over by Compaq, and companies no longer go to such extreme cases of duplication within the same box. The old machines were incredibly fault-tolerant, but the trend has been away from that design because it involved some incredibly complex proprietary technology. Nowadays companies rely instead on multiple, low-cost servers. Intel recently dubbed the system 'macro computing', which gives the impression that because it came up with the name it invented it. Intel didn't but it is promoting the idea because downtime is expensive and can cost companies in many different ways.

Often the biggest costs is not measurable in direct monetary loss, but in customer dissatisfaction. For a bank, the loss could be multiplied by an unhappy customer taking his or her entire family away; and in organisations such as Queensland Health it could even become a life-threatening issue.

System availability can be related to two key factors - the mean time between failure of any component or system and the mean time to repair them.

Hostworks managing director Marty Gauvin says that given those two numbers you can work out a percentage availability. "If I want to achieve 99.999 per cent availability it means I need fewer than four or five minutes a month of downtime. That could be one outage of four minutes or eight of 30 seconds.

"First, I am trying to increase my mean time between failure and secondly, I am attempting to reduce my mean time to repair.

"Two terms that are used almost interchangeably in work on availability are fault tolerance and redundancy. Fault tolerance simply means that the system is tolerant to a fault so a piece can fail and nothing goes wrong. Redundancy means that you have extra versions or fractions of versions of a system, but it does not necessarily imply that the failure is invisible," Gauvin says.

"What we aim for is fault tolerance across all layers, and our system is very much built upon layers. Each layer is fault tolerant so you can walk through our data centre turning off boxes, and providing you do not turn off two of the same, nothing will happen. We then have a level of redundancy that is appropriate to the rate of failure of that equipment.

"The main things that are left for us in terms of fault tolerance are the kinds of ways of implementing it because some of them are quite expensive.

There are still fairly big parts of our system where it can only be done by having a main working server that is doing all the work and an identical one sitting next to it that is doing nothing until the main one fails. That is inefficient. It is better if I can take a site that is spread over many Web servers at the same time. All of them are working but if any of them fail the remainder pick up the load of the failed one.

Because Hostworks hosts some of Australia's popular Web sites, availability can be critical and Gauvin has a complex formula for calculating downtime.

"You need a statistical analysis of the performance of a system, so you model it discretely. You work out how each hard drive is going to perform and its rate of failure, then you do the same with the memory chip and the other components, put them all together into a server and combine them mathematically to decide what its availability should be. Then add on top of that the maintenance agreement, which may modify the figure, then you put lots of servers together. Your estimate of downtime is gradually built up over some relatively laborious maths to build a theoretical model of the availability of the system."

Internal service-level agreements can play a key role in availability but they can also be expensive. The higher the service level, the higher the maintenance price, but Gauvin has managed to lower his reliance on service-level agreements by having critical components serviced by multiple suppliers.

"For example, with our primary and secondary level of network connectivity, one has a service level of 99.75 and the other has a level of 99.8. If I had only one of them, that level would be unacceptable for the kind of business I am running, but by having both of them we actually achieve about 99.995. So I am taking the service level they have given me and building it up by bringing in other vendors and overlapping."

At the other end of the scale medical supplies company Eli Lilly doesn't require anywhere near five 9s. In fact, depending on the time of the month its system can be down for a day before it starts to cause major problems.

"We don't have the same downtime stress that other organisation have because our system is only working during normal business hours," says the company's associate director of IT services David Christie.

"The cost of downtime varies during the month. We have times during the month when it would be terrible if we could not ship orders, but at other times of the month if something goes down for a day it is not a big issue because we can catch it up later."

Join the newsletter!

Error: Please check your email address.

More about CompaqHostworksIntelQueensland HealthTandem

Show Comments