Can one rogue switch buckle AT&T's ATM network?

The answer is yes.

Earlier this week a Lucent Technologies Inc. CBX 500 Multiservice WAN Switch started a network management message firestorm that overloaded 7 percent of all switches on AT&T Corp.'s ATM network for about 4 hours.

"AT&T has a lot of ATM customers and when 7 percent of the network is affected most if not all see some incidental impact," says Dale McHenry, product vice president for data services at the carrier. While some users experienced network slowdowns, others were completely shut out.

According to one user who asked to remain anonymous, the network failure was significant. "The network was hosed," she says. The user's company was forced to shut down all of its ATM interfaces because of problems sending Open Shortest Path First traffic over ATM.

"We had an extreme number of network management messages coming from one switch," McHenry says. This CBX 500 is one of the larger switches that AT&T has deployed at its network management center in New Jersey, therefore it carries a heavy amount of network management traffic.

The SONET ring the switch was connected to experienced a fiber cut early the same day, McHenry says. AT&T believes this was one of the events that triggered the switch to malfunction, but it may not have been the root cause.

The switch started sending out messages notifying the other ATM switches on the network that trunks were available and then unavailable over and over, he explains. This is called a "thrashing SONET ring." The switch eventually tapped its CPU power and memory and took itself out of commission. The other ATM switches on the network then tried to reroute the first switch's traffic and subsequently became overloaded.

AT&T has spent a lot of time modeling this type of outage and the company has contingency plans in place, McHenry says. The first step the carrier took was to remove several redundant trunks from the network. This simplifies the network so the switches that are still functioning are monitoring fewer trunks. "We have a plan on the shelf with safe trunk routes identified," he says.

AT&T reestablished the majority of its switches within the first couple of hours and the last one by the time 4-hour mark of the outage.

Communications from AT&T after the outage didn't please everyone.

"Our company [network operations center] did not receive the AT&T all-clear call until 6:30 a.m. [the following day]," one customer says. "It was likely that the network was restored earlier, but from a customer standpoint, we were not back on their network until morning."

All of AT&T's ATM customers should be covered by the company's standard service-level agreement, says Lisa Pierce, telecom analyst at Giga Information Group Inc. The carrier guarantees 99.99 percent network availability, which is equivalent to 43 minutes of allowable down time per month. The company also guarantees traffic will be delivered roundtrip within 120 milliseconds. There are standard cell delivery SLAs that range from 99.95 percent to 99.99 percent.

"To prevent this type of event again we're looking at simplifying the trunk structure, which is a key fix as such events occur," McHenry says. AT&T has also removed some of the network management load on switches. And the company is looking at redirecting traffic to better balance the traffic load on each ATM switch. McHenry says this should not affect customer traffic.

Join the newsletter!

Error: Please check your email address.

More about AT&TGiga Information GroupLucentLucent Technologies

Show Comments