Working to a New Standard
- 01 May, 2000 12:01
FRAMINGHAM (05/01/2000) - On a workday morning at a command center monitoring and controlling information systems for a global corporation, you'd expect hustle-bustle, barked commands, some flashing lights and maybe even the peep, peep, peep of an alarm.
Instead, the softly lit amphitheater of the Operations Control Center (OCC) at the Prudential Insurance Company of America Inc. is nearly empty. Three or four technicians sit in generously proportioned chairs, looking at big monitors and speaking quietly into their telephone headsets. Across the room, two or three men stand, coffee mugs in hand, apparently discussing something they're watching on a monitor.
This handful of engineers in the Roseland, New Jersey, OCC is monitoring and controlling Prudential's call centers, three data centers, 15 OS/390s and nearly 75,000 desktops at 1,466 sites worldwide.
Five years ago, information technology managers at Newark, New Jersey-based Prudential saw their systems mushrooming and projected that growth into a future in which they would be overwhelmed and the business units that depended on them would grind to a halt.
"I figured out that to manage systems just for one of the data centers, you'd need to watch 18 separate consoles," says Kenneth Tyminski, vice president of information services at Prudential's Corporate Technology Services. "An alarm could flash across the screen, and unless you happened to be watching that monitor at that moment, you'd miss it."
What was needed was a way to see the status of all systems in real time, in a format that was easy to read and that offered great detail, all on a single console, he says.
Tyminski and Arun Kant, vice president of enterprise systems management at Prudential's Corporate Technology Services, saw a way to build such a porthole.
They worked with a small software developer, Accessible Software Inc. in Parsippany, New Jersey, which this year was bought by IBM subsidiary Tivoli Systems Inc. in Austin, Texas.
Their plan included integrating Common Information Model (CIM) into Accessible's monitoring-data correlation and report software, Access 1.
CIM? What's That?
CIM was barely a concept at the time, and no hardware or software yet supported it. "At the time, people were saying, What's CIM?' " Tyminski says. "We had to do something about managing our systems right away; we couldn't wait for CIM," he says. "But we looked at [how Simple Network Management Protocol (SNMP) standards had changed network management], and we wanted to be ready to take advantage of CIM when it became available."
Nearly all network hardware and software support SNMP standards. SNMP agents on network devices collect data on device activity and report management data to network management systems via Management Information Base. Use of the standard lets devices from different vendors interoperate out of the box.
CIM extends that interoperability and control. The standard is a set of schemata for describing management data for applications, devices, services and the relationships among them as an object.
If CIM were as commonly supported as SNMP, systems managers could add applications, devices and services and change relationships to reflect business processes, and their systems would automatically recognize and be able to manage each as an object.
In the past five years, engineers at Prudential, Accessible and enterprise management software vendor Tivoli have worked to evolve Prudential's management systems with an eye to standards of the future. For example, CIM relies on a common repository of definitions used by all CIM objects. A change to a definition in the repository is automatically reflected in each object throughout the system. Prudential designed its systems architecture on that model.
Even before CIM was issued as a standard by the Distributed Management Task Force (DMTF), vendors such as Cisco Systems Inc. in San Jose built some CIM functionality into network hardware. Some software vendors, including Tivoli in its Enterprise framework software, which Prudential uses, and Microsoft Corp. in Windows 2000 and Windows Management Infrastructure, have built on CIM standards.
"With CIM, new applications, new devices [that are CIM-enabled] will just snap into our" monitoring system, helping to make operating the OCC "even more efficient and proactive," Tyminski says.
"It's already making a difference," he says. "We collect all this information on the infrastructure, the applications, the network. CIM allows us to use the information in a different way."
He adds: "Business units used to just want to know whether their applications were up or were they down, are they available or not? Now they want to know what's happening with them. If it's slow, what transactions are slowing it down, how is it being fixed, and when is it going to be available again?
Not a lot of time was devoted to planning. The development "process depends on a feedback loop. We do something, then see how it's working and ... tweak the process," he says.
During the "tweaking," when the system wasn't yet working, getting high-level executive support was even more critical than initial buy-in, Tyminski says.
Implementation of the new system began in 1998 when the OCC went live. System outages were cut by 31 percent last year. Downtime - planned or resulting from unavailable applications, for example - was reduced by 38 percent, and the mean time to repair problems in the systems infrastructure was cut by 55 percent, Tyminski says.
"But the best part is we've been able to do it proactively," he says. "You have to remember that it's not just about the tools. It's about the right process and a lot of people making it work. Once you have that, you fill in with the tools to support the process."
More Work, Fewer People
Tyminski leads a tour of the OCC, pointing out technicians in charge of NT servers, Web servers, DB2 and CICS applications. And at the far back, overseeing it all, sit the problem managers.
When a situation can't be resolved quickly, a problem manager takes over, coordinating personnel from various engineering groups.
On a recent Friday morning, two problem managers were on duty, although only one sat monitoring a screen.
The OCC is staffed around the clock by 107 people, deployed over three shifts, Kant says. "Three years ago, we probably had about 160. Now, at peak times, there are maybe 35 on duty," he says. Usually, there are only about 10 or 15 people on duty at a time.
What makes it possible to manage such large systems with such a small staff is the Access 1 software, renamed Tivoli Manager for OS/390. "It's misnamed," Tyminski says of the product, because although "it may have started out for OS/390, we manage Windows, Unix, our call centers, all kinds of things with it."
Kant and Tyminski walk through a demonstration on a live screen.
Each business unit's icons are linked to a series of icons representing all the applications, devices, services and networks supporting that line of business.
A red arrow pops up on an icon, and Kant clicks to drill down to see where the problem is. The red arrow is on the NT server icon. Another click brings up another layer, another red arrow. One more click exposes another layer but no red arrow, indicating that the problem lies in the layer above, which is a CPU in Scottsdale. A window at the bottom of the screen reads - in plain English, not in the language and syntax of the monitoring application - that a network connection is down.
Converting alert messages into English was one of the most difficult hurdles, Kant says. Not only were similar events reported differently in different monitoring applications and different platforms, he says, but language familiar to the IT engineers was unfamiliar to the systems management engineers.
"We had to make the languages for error messages and conditions the same whether it was for a Cisco device or a laptop or anything on the network," Kant says.
Kant's group has written about half the rules, mapping definitions from each application to a definition in a common repository, with help from Accessible (now Tivoli) to write the rest.
"We had to come up with a common library so that all fault analyses would come up the same, and we wouldn't have to write different rules, set different thresholds for each [application]," he says.
The Directory-Enabled Networking standard, expected this year from the DMTF for CIM, works on just such a library, or repository, as that built by Prudential.
"CIM is going to be indispensable in the e-business environment," Kant says.
For example, "you can go to our Web site and do several things. You can see information on different mutual funds or you can see how your mutual funds are doing," he says.
"Now when [users try to] access information and there's a failure, they get a message saying, Page not found.' They call the help desk, and that [error message] is not very helpful in tracking down the failure," he says. "Because all we know is there's been a content failure; we don't know which application is responsible. Or it may not be an application failure. There may have been a lot of hits all at once, and the server is timed out. Or a load balancer may have gone bad or routed [the request] to the wrong content."
The arrow is still red, and Tyminski clears his throat and raises his voice slightly. "Are we aware of this connection down in Scottsdale?" he asks of those in the OCC.
A voice from the NT server bank answers, "Yeah, we know about it; we're working on it now."
Kant double-clicks to show where the online runbooks - which include notes such as code changes - for each application will be linked by next year. The red arrow disappears.
The two men beam. "That's the way it's supposed to work," Kant says.
"This isn't easy," Tyminski says of the work that went into developing the OCC and the underlying technology.
"It takes a lot of perseverance because you have to commit to an ongoing process. [The process] doesn't ever end," he says. "Come back in six months, and we'll be managing further out into the environment. We're in a constant state of evolution."
Prudential: In Control
Prudential's OCC manages:
-- 1,466 sites worldwide
-- 15 OS/390s with more than 10,000 MIPS of processing power and a 50-terabyte direct-access storage device-- 4,095 Windows NT and Unix servers-- 74,854 networked NT desktops and Windows 95 laptopsNetwork:
-- 1,100 routers
-- 500 switches
-- 900 hubs
-- 3,800, LAN segments
-- 1,700 WAN links
-- 7,500 token ring or Ethernet adapter interfaces What's CIM?
Common Information Model is a set of schemata for describing management data for applications, devices, services and the relationships among them.
Now a Web-based Enterprise Management standard approved by the Distributed Management Task Force, CIM development began in 1996 as a joint effort of BMC Software Inc., Cisco Systems, Compaq Computer Corp., Intel Corp. and Microsoft.