Ten years ago, the WAN was the exclusive domain of frame-relay communication and leased lines. Today, a WAN may use anything from IPSec connections and cable modems to MPLS (multiprotocol label switching) tunnelLed over multi-megabit networks. The methods may have changed, but the challenge remains the same: how do you make a WAN seem like one big LAN?
Simply throwing more bandwidth at the problem won't solve it. MPLS can go a long way towards improving WAN performance, but the root cause of the problem lies well below the MPLS level.
Other forces are at work conspiring to rob your WAN's performance and response time; latency, congestion, chatty applications, and traffic contention all have an impact on how the WAN may respond at any given time. These are the dirty secrets of WAN performance that are usually
Size doesn't matter
In the world of the WAN, the size (that is, bandwidth) of the link often makes little difference in overall performance, particularly when the link is a long one ("long" being more than a few hundred kilometres). Part of the problem is that TCP and other protocols weren't intended to function beyond the local-network edge.
"The reason why long-distance networks don't work is that the protocols weren't designed to do that," says Dick Pierce, CEO of Orbital Data, which sells WAN-optimization appliances. "They work pretty well on a local basis, and in some cases even short distances. But wide-area networks don't. The whole history of how this market segment [WAN optimization] developed was on that basis."
The problem is that the protocols' efficiency suffers as latency increases. Latency is based on the speed of light and the overall length of the WAN link, something we have little control over. Don't think speed of light is a factor? Just experience the latency in a satellite link. (A few years back, one could have argued that routers and switches added significant latency to WAN links, but most backbone equipment today works in the sub-millisecond range.)
Latency affects network protocols in various ways. TCP, for example, uses ACK (acknowledgement) packets to help provide reliability. By receiving an ACK from the receiving endpoint, the sending system knows the packet made it without any errors. But on high-latency links, waiting for ACKs chokes throughput.
Thus, latency is one of the biggest -- if not the biggest -- killer of WAN performance, both in response time and overall throughput. Long fat networks (LFNs) run at T1 speeds and higher, but suffer greatly from the inherent latency of the link. In the US for instance, for most terrestrial links, the average round-trip time is about 150 ms, with satellite links averaging about 800 ms. Global links vary greatly, but it isn't uncommon to see 200 ms to 400 ms or higher RTTs (round-trip times). And increasing the bandwidth doesn't help.
In fact, due to latency, LFNs are largely underutilized. "The reason people built long-distance pipes that turned out to be empty was they were trying to get predictable application performance by overprovisioning," Orbital's Pierce says. "Yet the inherent design of the networks -- that they weren't designed for long distance -- was the problem."
Congestion also affects WAN performance, of course. Congestion occurs when no bandwidth-allocation policy has been applied to traffic on the WAN. Traffic flows can be bursty, such as when one user tries to retrieve a large e-mail attachment while another user log-ins to a CRM portal. With no bandwidth management, the download can bring the smaller link to a grinding halt.
PG Narayanan, CEO of Allot Communications, believes that much of the congestion problem can be solved by applying QoS to the traffic. "The problem most of these networks have, though, is temporary . . . that second, or that minute it's congested, you can get away with just prioritizing applications. So what you can do is put a gigabit box at the central site to prioritize those applications, the critical applications, on a temporary basis, and you can avoid the congestion, and all other times you're OK anyway," Narayanan says.
Prioritizing application flows is an important part of managing your WAN traffic, but it isn't going to solve TCP's inherent limitations when latency creeps in. On shorter links where latency isn't an issue, simply pre-allocating your bandwidth will help keep important packets moving, regardless of what else is in the pipe. But on LFNs, latency, not congestion, is the culprit.
Talk, talk, talk
From the end-user point of view, latency gets less tolerable as the back-and-forth communication required for some action increases. And layer 7 protocols -- where applications live -- are chatty, requiring an absurd number of round-trips to complete a single task. Much like TCP, protocols such as CIFS and MAPI (mail application programming interface) were designed to run inside the LAN, not over the WAN.
The chattiness reaches a crescendo when users map drive letters over the WAN using CIFS (used in Windows networks). Any user that has had to open, edit, and save a Microsoft Word or Excel document from a remote file server knows how long this simple task can take, even over a fat WAN connection. By the same token, users of Microsoft Outlook and Exchange 2000 suffer when they open an e-mail with an attachment over a WAN link. The message appeared to be in their inbox, but in reality it was still on the server waiting to be retrieved.
Microsoft Exchange Server 2003 was designed to mask this problem by downloading messages and attachments in the background (cached Exchange mode). Although this is great for the end user, it adds additional traffic on the WAN. For example, Outlook now downloads all attachments to your inbox, regardless if you were going to open them in the first place. This places an additional load on the WAN link, which should never happen.
Out with the old
Traditionally, WAN performance was attacked at the packet level. Back in 1998, Expand Networks was one of the leaders in WAN compression. Liad Ofek, vice president of technical services at Expand Networks, says that, at the time, the goal was to "squeeze as much data as possible" into existing links.
Expand used a series of compression algorithms to reduce the number of packets on the wire. Other vendors, most notably Packeteer, also used highly advanced compression schemes and began adding QoS to further allocate and manage WAN traffic flows.
File-caching provides yet another way to reduce traffic by storing a copy of recently accessed files on an appliance near requesting users. As with a browser cache, files and objects are kept closer to the remote user, helping to overcome latency and prevent excessive, redundant requests over the WAN. This is typically a "full file" cache and not made up of smaller data segments. Full-file caching isn't nearly as effective as newer segment-caching methods, because the chance of a second or third user requesting the same file is slim. Also, if the file on the file server is renamed or changed, then it won't match the file already in cache and must be transferred again anyway.
In with the new
In recent years, TCP acceleration has taken centre stage as one way to improve performance by reducing ACKs and playing games with the TCP window size. Vendors such as Swan Labs, Peribit (now owned by Juniper Networks), Expand Networks, and Riverbed Technology have all developed solutions based on improving TCP's performance.
One of the most effective methods is to handle TCP ACKs locally, using an appliance. The appliance bundles multiple ACKs into a single request, thereby reducing the delays caused by high latency. To the application requesting the data, it receives an ACK just as it expects to, except the ACK comes from the local WAN appliance and not from the far side of the WAN.
The next step beyond TCP tricks is application-specific acceleration. Some WAN optimization vendors use plug-ins in their appliances to help improve application response. Applications such as DNS, Exchange, FTP, Citrix, Notes, and CIFS/NFS can all benefit from reduced chatter on the wire. The plug-ins work much like the TCP ACK optimization in that they handle redundant requests locally instead of sending each one.
There is no quilt
The WAN optimization and acceleration space is heading towards a convergence of sorts. In the past some vendors specialized in a single technology solution, but now they are adding other technologies to solve additional pieces of the WAN problem. Orbital's Pierce sees the multiple approaches to solving WAN problems as "patches, in the context of patches and a quilt. In the end, it's about the quilt; it's not about the patches themselves. Customers buy patches today because there is no quilt". The trend is for vendors to move away from "point" solutions to a more comprehensive managed system.
Several WAN appliances include compression and TCP acceleration along with file-caching and application-specific acceleration. But not all vendors agree that such consolidation is wise. "I think more customers are more worried about just the visibility into the network," says Allot's Narayanan. "They want a good traffic-management company with the ability to decode any application layer properly, not falsify it."
Other vendors, such as Swan Labs, Riverbed, Disksites, and Juniper Networks, are banking on single-box solutions. Tom Tansy, vice president of marketing at Swan Labs, sees a further consolidation of technology. He believes many customers are suffering from a "box proliferation problem" and will want to roll out a single appliance instead of many disparate solutions.
Either way, when it comes to speeding up WANs, everyone agrees that more bandwidth alone is not the answer. As long as TCP remains unchanged (and for now it has to) and the speed of light governs latency, boosting WAN performance will require tricks at the protocol level, combined with traffic-flow prioritization and application-specific packet reduction. WAN acceleration solutions will continue to evolve to include multiple techniques for getting most out of your link, at least until we find a way to send data faster than the speed of light.
Far-flung file serving
WAFS (wide-area file sharing) appliances combine WAN optimization with file-caching in an effort to remove file servers from remote offices. Sometimes called wide-area file-server replacements, they also rely heavily on application-specific acceleration, especially CIFS and MAPI.
The goal of WAFS is to provide for a completely serverless office. In the not so distant future, instead of deploying file and mail servers to remote offices, IT managers will simply ship a preconfigured rack-mount appliance to each remote site, and possibly preloaded with cached data, as a file-server replacement.
One of the things that will separate the real WAFS players from the pretenders is how the appliance handles file-locking. For instance, a user at a remote office opens an Excel spreadsheet over the WAN. The WAFS appliance keeps a copy of the file in its local disk cache. While the user works on the sheet, all saves are done locally on the appliance and not sent over the WAN -- greatly improving application response. When the user closes the file, the WAFS appliance then sends the updated document to the file server.
During this editing process, the WAFS box -- if it does its job correctly -- will have issued a file-lock on the open file on the server, and then released the lock on the final save and close. This gets interesting when the WAN goes down while the file is still open.
The remote user will keep on working on the locally cached copy. But what about the locked file on the file server? What happens if another user opens the same file before the first user gets a chance to save his file? A good WAFS vendor should be able to answer such questions, but there's no perfect response.
One application that takes great advantage of WAFS advanced caching is distributed backup. Whereas backups from remote offices to the datacenter used to take hours to complete, they now take mere minutes thanks to file-differencing algorithms in the WAFS appliances. So instead of sending the entire file set over the WAN, only the changes are sent, greatly reducing the time needed to complete the backup.
For WAFS to succeed, it has to smoothly marry all WAN acceleration techniques into a comprehensive solution. But where it really shines is the cache-differencing and local-file access. On high-latency links, this alone makes it worth the cost of installation.
Some WAN things you just can't control
In the quest for ultimate performance, IT folks tweak TCP stacks, strip out unnecessary services, and manage traffic flows. But where the WAN is concerned, some things are in the hands of the ISP rather than IT. That's a sad fact of life: As link speeds and round trip time increase (latency), overall throughput degrades tremendously if router queues aren't sized accordingly.
Work by researchers Curtis Villamizar and Cheng Song has found that TCP performance does not depend on link speed alone, but on many factors such as the product of the link speed and round trip time, also known as the BDP (bandwidth delay product). An interesting application of BDP is in determining the proper queue size (in packets) for a backbone router. This value is found by dividing the BDP by the product of the MTU (maximum transmission unit) size converted to bits.
Many routers come preset with buffer levels that have not been set for optimum efficiency. If the buffer size is too small, the risk of congestion and TCP retransmission arises -- a major cause of performance drop off. Too large a buffer and latency is increased. Both situations cause TCP to slow down, greatly reducing a WAN link's efficiency and throughput. (Like Goldilocks said: not too big, not too small, but just right.)
Ethernet's default MTU of 1500 Bytes is fine for speeds of 10Mb/s or even 100Mb/s, but orders of magnitude too small when you reach Gigabit and greater speeds. One way to help improve your WAN performance is to use a larger MTU size. But again, since you have no control over the MTU settings used by your provider, if you use jumbo frames and your ISP uses standard sizing, your packets will be fragmented and repackaged at the smaller value as determined by the router, increasing the load on the router further complicating performance.
Problem is, unless you maintain your own backbone, there's little chance that your provider will tweak their routers to suit your link speed and latency. While you could beg and plead with your provider to modify their queue size for your deployment, because your traffic is mixed in with other traffic they manage, optimal settings for your installation may adversely affect another user. So you end up with a semi-arbitrary queue size that falls somewhere in the middle.
New to WANs? This grab bag of terms will help avoid confusion.
- ACK: Acknowledgement packer use by TCP/IP to let the sending system know the packet arrived intact
- Bandwidth-delay product: A number expressed in either bytes or packets that is the bandwidth multiplied by the latency; used to determine proper router queue sizing, with BDP= RTT * C, with RTT is the average round-trip time, and C is the bandwidth of the link in kilobits per second.
- CIFS: Common Internet File System, the file-sharing protocol for Windows networks
- Latency: The time, usually measured in milliseconds, that a packet takes to travel from one end of the WAN to the other
- MPLS MultiProtocol Label Switching: A standard for routing traffic over an IP network so that all packets follow the same path
- MTU: Maximum transmission unit, the largest size of a packet or frame
- QoS: Quality of Service, a method of defining a certain level of performance for distinct IP traffic flows
- RTT: Average round-trip time, expressed in milliseconds