Watching the WAN

FRAMINGHAM (04/03/2000) - Reynolds Metals decided it needed a service-level monitoring tool shortly after it migrated from dedicated leased lines to frame relay, and users and applications began contending for bandwidth.

When people started complaining about slow response times or unavailable mainframe connections, network managers had no idea what the problem was, says Deborah Shashaty, a communications specialist at the fabricated aluminum manufacturer in Richmond, Virginia.

So Reynolds Metals turned to Visual Networks' Visual UpTime WAN management tool, which uses intelligent agents sitting on DSU/CSUs to capture and analyze the company's frame relay traffic. The agents can measure latency and throughput, break down bandwidth usage by Layer 4 protocols and determine whether a problem originates on the edge routers or the carrier circuit.

Reynolds Metals is far from alone. Service-level monitoring tools are hot in corporate America - particularly among dot-coms, online brokerages and other companies that stand to lose tens or hundreds of thousands of dollars for each hour of network downtime, says Rich Ptak, vice president of systems and applications management at Hurwitz Consulting in Boston.

With so much at stake, corporate IT executives want to manage carrier circuits as intrinsic parts of the enterprise network. They want to obtain current, detailed information about availability, performance and latency, and actively partner with carriers and service providers in capacity planning and proactive problem management.

One way that businesses attempt to manage carrier circuits is by enforcing service-level agreements (SLA). An SLA is a contract in which carriers or service providers guarantee minimum service levels and spell out penalties for shortfall. However, companies are increasingly finding SLAs to be blunt instruments at best.

"Carriers don't have the time or technological means to support much granularity in their SLAs," says Mary Jander, senior analyst at Enterprise Management Associates in Boulder, Colorado.

SLAs typically guarantee availability as the percentage of time a carrier's circuits are up over a specified time period. This gives the service provider far too much slack, IT executives have learned. If the SLA specifies 99 percent uptime averaged over a month, for instance, circuits can be down for seven hours before the carrier can be accused of SLA noncompliance. "If that 1 percent happens at year-end closing time, you're in trouble," says John Morency, executive vice president of Sage Research in Natick, Mass.

Furthermore, the standard service-level reports that carriers generate are much too general and historical to be useful, users complain.

In the reports AT&T and MCI WorldCom were giving Reynolds Metal, for example, "The information was averaged over 15-minute intervals and was very after-the-fact: There was no way to see what was happening at any given time," Shashaty says. Moreover, standard carrier reports don't break down bandwidth usage by protocol. "We needed [to put] our own eyes into the network," she says.

Inside information

The need to monitor performance is driving a growing demand for WAN service-level monitoring tools that IT departments can use on their own. Most Fortune 500 firms are evaluating such products today, and 40 percent to 50 percent have implemented them, according to Hurwitz's Ptak.

Upwards of 60 vendors provide WAN service-level management tools in the following categories:

* Software agents residing on DSU/CSUs that conduct real-time monitoring of performance, bandwidth usage and latency across ATM, frame relay and T-1 connections.

* Desktop agents that measure end-to-end performance and latency between the client and a server.

* Probes that periodically poll SNMP and Remote Monitoring (RMON) Management Information Bases (MIB) on network devices.

* Platforms that collect, store and generate reports on information from a variety of management agents and systems.

These various types of products allow network executives to monitor service levels in all key areas covered by an SLA, including service availability, throughput and latency. Some tools also break down bandwidth usage by specific Layer 3 and Layer 4 protocols such as TCP/IP and HTTP.

Such information plays a critical role in pinpointing and resolving problems on the WAN, users attest. "Visual UpTime lets us see if the pipe is fully used, what type of traffic is going over it, and the top talkers," Reynolds Metals' Shashaty says. For example, when users report slow response over a network, net managers can check bandwidth usage. "We might discover that someone is doing a big FTP file transfer, so we know it's a problem with our traffic, and not the frame relay link," Shashaty notes.

If the problem is on the enterprise network side, Reynolds' network staffers can fix it themselves. If not, they share their data with the carrier. In either case, the data eliminates finger-pointing and speeds troubleshooting.

Network executives increasingly want to do more than monitor traffic flowing between border routers across the WAN, however. They want to be able to manage WAN circuits as one element in the enterprise network; in other words, to measure end-to-end network performance from the client to the server and zero in on any problems.

Until recently, the industry was too fragmented to make this goal practical, and network managers were struggling to correlate and interpret information generated by a variety of tools and agents.

Just ask Todd Spears, a network analyst at First Union National Bank in Charlotte, N.C. Until recently, the bank was using the following products to manage service levels across the enterprise:

* Lucent's VitalNet (formerly Enterprise Pro), which monitored and analyzed network traffic for capacity planning purposes.

* NetScout Systems' NetScout Manager Plus, which performed trend analysis of protocol-specific traffic for capacity planning.

* Concord Communications' Network Health, which monitored permanent virtual circuit utilization and other threshold events on frame relay access devices.

* Paradyne's FrameSaver SLV DSU/CSU agents, which performed real-time monitoring of traffic on frame relay links.

* Network Associates' RouterPM, which polled various router MIBs and generated error, exception and utilization reports for problem resolution.

"We wanted one tool that would allow us to put all that information together and get the state of the world," Spears says. In particular, IT wanted to be able to map how network events affect service levels and application performance, and to identify the sources of problems.

In the past year, service-level management vendors have begun to address this need, according to Hurwitz's Ptak. "We're moving toward the ability to use a single vendor's product to get information across all pieces of the network environment and then associate it with the business service or application," he says.

A growing number of service-level monitoring platforms combine versatile probes and agents with a database and reporting infrastructure, enabling users to gather data from a variety of sources and "normalize" it so it can be analyzed, sliced, diced and presented in reports. Such products include Concord's Network Health, DeskTalk's Trend, InfoVista's VistaViews, Lucent's VitalNet and NextPoint's S3.

First Union is now implementing InfoVista's VistaViews. The bank chose the product because of its customization features: "You can poll routers once an hour or once a minute" - and the broad range of sources it monitors, Spears says. In addition to SNMP and RMON MIBs, VistaViews can gather data from proprietary sources such as Cisco's Service Assurance agent, any management tool that generates flat files or an Open Database Connectivity-compliant database.

Able agents

The next step for the bank is to deploy end-to-end service-level management.

There are now several desktop agents on the market that can monitor end-to-end performance and latency for specific protocols and applications such as Microsoft's Exchange and SAP R/3. First Union is evaluating First Sense's Enterprise and Lucent's VitalAgent desktop agent.

There are two types of agents to choose from: passive and active.

Passive agents are installed on the client and monitor whatever traffic the user generates. For example, Lucent's VitalAgent sits on desktops and monitors specific types of transactions, such as an HTTP or SQL database query. Server software collects the data and determines the source of the problem. First Sense's Enterprise is a similar type of passive agent.

An active agent measures response time by simulating application transactions, generally at regular intervals. For example, Response Networks' ResponseAgents query servers, measure response time, and then perform pings and other basic tests to pinpoint sources of problems. The tests are initiated by middleware entities called Domain Controllers. Users view the collected data on the Response Service Explorer console. All three products are components of the Response Center Suite, which costs $50,000 and up.

Active and passive agents have potential drawbacks. Because passive agents must wait for the user to generate specific traffic, they don't work when users aren't at their desktops. For example, it would be difficult to use the agents over the weekend to test whether you adequately fixed a network problem that surfaced Friday afternoon.

Desktop agents, on the other hand, depend on accurate information about the precise applications a computer runs in order to function effectively. This forces IT to perform lots of upfront discovery work and then check back regularly to see what's changed.

Some companies are waiting for the agent technology to mature before plunging in. "We've chosen to wait it out and probably jump over the current technology," says Bob Uhl, director of network technologies for Ernst & Young in New York. Once the professional services firm has finished moving most of its desktops to browser-based software, it may be possible to set up client-based response-time reporting through applets, he adds.

As an application service provider (ASP), Equant has a strong interest in monitoring customer service levels all the way to the desktop - but doesn't expect to accomplish this in a hurry. "The service-level management market is very fragmented. I don't think this will be a one-tool decision," says Anita Folk, a spokeswoman for the Atlanta company.

There are also logistical challenges associated with implementing software on all those desktops. Privacy is a concern when a company needs to install software on the desktops of partners or customers. And if you're an ASP, there are some serious scalability issues. "We deal with multiple customers. Just how many desktops are we talking about putting agents on?" Folk asks. "And do we ask users to standardize their applications so we can monitor them?"

In addition to potential difficulties with the agents, businesses - particularly ASPs such as Equant - are wondering how easy it will be to gather and correlate the data the agents deliver. "We'll need some kind of server engine," Folk says. Make that a very scalable engine.

Collective coordination

Many companies want a WAN management platform that not only collects client response-time data from client agents, but also correlates it with network performance and availability data generated by RMON and SNMP probes, DSU/CSU agents, and other service-level monitoring tools.

There have been some promising developments.

Concord, for example, recently acquired Empire, which sells active monitoring agents, and First Sense, which sells the passive agent Enterprise. Concord has promised to integrate the tools into its Network Health suite, although it hasn't yet announced a time frame.

Lucent's VitalSuite 7.0 provides a single infrastructure for collecting and reporting on data from VitalAgent, the desktop client agent, and VitalNet, the SNMP-based WAN and LAN monitoring tool.

Visual Networks is working to integrate Visual UpTime with two products the vendor recently acquired: Avesta's Trinity, which correlates service-level alerts and other events to determine the source; and Inverse's IP Insight, a client-based agent that monitors latency primarily over access lines.

Meanwhile, a working group within the Internet Engineering Task Force is developing an Application Performance Measurement MIB. The MIB will provide standardized definitions for key information associated with measuring end-to-end application performance over a network, says Steve Waldbusser, chief strategist at Lucent's VitalSoft division in Sunnyvale, Calif. Network managers will be able to gather data from different vendors' agents then merge it with other SNMP-based data into reports. The standard is scheduled to become stable enough for vendor implementation in about a year, Waldbusser says.

As service-level monitoring tools become more powerful and widely used, the question arises as to whether the information they provide will pit corporate network managers against their carrier counterparts. Will customers use such tools to try to catch carriers breaching SLAs?

Not necessarily.

While companies are definitely using service-level monitoring tools to check if carriers are meeting SLA metrics, several people emphasized that they see little advantage in treating their carriers as adversaries.

"We could run Visual UpTime reports and see if we come up with the same numbers our carriers did, but it's a cumbersome process," Reynolds Metals' Shashaty says. Besides, she points out that the penalties a carrier pays for breaching the SLA don't even come close to matching the business cost of downtime.

"Anyway, we don't want our money back, we want quality of service," she says.

A more fruitful way to use such tools, some IT executives suggest, is working collaboratively with carriers to deliver better service. Shashaty notes, "We use our tools to help carriers meet high levels of availability and fast restoration, rather than waste a lot of time proving they don't."

Chuck Williams, First Union's senior vendor relationship manager, shares the same goal. "I envision us going to monthly review meetings and providing metrics, accurate information we could use to determine whether they're fulfilling their SLAs and to identify the source of a breakdown quicker," he says. "We just want to take some of the management responsibility so we can give useful information back to the carrier and be true partners."

Horwitt is a freelance writer and consultant in Waban, Massachusetts. She can be reached at ehorwitt@world.std.com.

Join the newsletter!

Error: Please check your email address.

More about AT&TConcord CommunicationsEnterprise Management AssociatesEquantErnst & YoungErnst & YoungFirst UnionInfoVistaInternet Engineering Task ForceLucentMCIMCI WorldComMicrosoftNETSCOUTNetScout SystemsParadyneResponse NetworksSage ResearchSAP AustraliaVisual NetworksVitalSuiteWorldCom

Show Comments

Market Place