A truly reliable network has to be able to handle the volume of traffic and failure of equipment. The unfortunate truth is that you can have redundancy or load balanced capacity but rarely both at once.
Here is a piece of marketing puffery that has caught out a lot of experienced people:
“Load balancing with two links gives you twice the capacity and if one of them fails you're still working because they are redundant”.
The problem with this statement is that one solution is typically used to solve two problems, but the circumstances behind the problems are almost mutually exclusive.
The mutual exclusivity is easily demonstrated by considering the simplest example: Adding a redundant link to an existing network with one link and using a load balancer to address capacity and reliability issues by sharing work across the 2 resources.
If we are truly running out of capacity then running on the single link won’t work unless we shed load in the case of failure.
The situation is complicated by the characteristics of load balancing.
On the capacity side rarely do one and one equal two. There is always an overhead.
One particularly nasty and illustrative example was a load balancer for ISDN that I used 15 years ago, this device would push packets alternately down each of the B channels; the cost was paid by the routers at the source and destination having to buffer and re-sequencing the streams. The end routers needed both more memory and faster CPUs because of the load balancer’s strategy.
On the reliability side if you have two links, with one failed, all the traffic has to fit down the failed link. So you can only use the non-failed resources.
Unfortunately, as IP protocols degrade rapidly as congestion increases the naive inference that if normal traffic is less than total capacity of the load balanced link minus the failed link should work, is dangerously false. Growing Internet demand only exacerbates the problem by increasing to fill the link capacity – both roads and Internet links tend to suffer from induced demand.
When considering network requirements it is necessary to assess capacity separately from reliability provided by redundancy. That capacity can be considered to consist of essential traffic - that must work in a crisis - and optional traffic we can shed when things are broken. It is critical that the business agrees about the classification. After designing your capacity and factoring in growth, redundancy is added to the system to handle the reliability required. In the simple case, this typically nearly doubles the cost of the network required. If the cost is too high there are only 3 ways to proceed: revisit the classification of the traffic, alter the reliability requirement or think outside the box.
Outside the box solutions include moving equipment to the data rather than the data to the equipment. This could lead to positioning equipment in datacenters for lower cost bandwidth or developing a disaster recovery site for critical services. Both alternatives may have long term advantages over just increasing link capacity. Other outside the box solutions include reducing data transfer requirements through compression or caching.
Maurice Castro is a member of SAGE-AU a not-for-profit IT Operations and System Administrator profession organisation, SAGE-AU. For more information go to http://www.sage-au.org.au/