BOSTON (06/12/2000) - The Washington state legislature has a data management problem. Its constituent hot line has topped 2 terabytes (TB) of online storage and keeps growing.
The politicians see the hot-line data as critical to their future, says Kevin Hayward, a database administrator at the state capital in Olympia. "If you're not communicating with the people who have elected you, they're not going to re-elect you," he says.
Hayward is confronting what's quickly becoming the most complex problem facing information technology: managing amounts of data that are growing faster than expected.
Delta Air Lines Inc. in Atlanta, for example, put more than 80TB online in less than one year. And Critical Path Software Inc. in Portland, Oregon, created the same amount of information in half that time. Adding more disk storage systems with many more servers is too expensive, and using proprietary storage-area network (SAN) products could prove risky if your vendor of choice doesn't prevail in the SAN standard contest.
The storage problem is only going to get worse because of e-commerce, says Richard Winter, president of Winter Corp. in Waltham, Massachusetts. Web shoppers' "clickstream' data is creating an immense amount of information," says Winter. Web sites need to collect and analyze everything - what people looked at, compared with, visited repeatedly andordered or dropped from a shopping cart, he says.
Although e-commerce CIOs don't have many low-cost, streamlined alternatives to the data management problem today, the future looks a bit brighter because of work that's being done by researchers at Lawrence Berkeley National Laboratory's National Energy Research Scientific Computing Center (NERSC) in Berkeley, California. They have developed a way to use tape systems that operate as if all the tape data resides on disks.
STACS of Data
Arie Shoshani, head of the scientific data management group at NERSC, has been working for years with data-intensive applications, such as those used in high-energy physics. And he knows that moving large files from tape to disk takes time. For example, a typical 1TB application running in the NERSC supercomputer center could take as long as 30 hours just to load data. When you're exploring the fundamentals of the universe, you can expect to wait a while, Shoshani says. But when some programs demand to search 300TB and beyond, waiting months for data to be searched - not even processed - is too long even when charting the moments after the Big Bang.
Adding the necessary online disk storage systems isn't practical because of the high costs. "Disk prices are coming down, but tape system costs are going down at roughly the same rate, and there is still a 10-to-1 ratio in favor of tape," Shoshani says.
That ratio helped inspire him and fellow researchers to seek solutions for efficiently managing data on tape. They created the Storage Access Coordination System (STACS) by working closely with physicists, climate modelers and scientists as they developed their data-hungry applications.
"Most systems store data in the order in which they are received," Shoshani notes. "But that may not be the best order for analyzing the data for the science involved."
In one instance, scientists captured the results of millions of particle collisions, called "events," which are created in an accelerator. When they need to analyze these events, physicists typically only want a subset of the millions of events. To search all 300TB of available data requires that they read 10,000 30GB tapes - a daunting prospect when all they want is a small subset.
That's where STACS comes in. It handles the queries the application makes of the stored data. The system minimizes the number of files and tapes that have to be read by using a specialized index of the millionsof events. It optimizes retrievalby grouping queries that request the same data. It also schedules bundles of files that will need to be processed at the same time or in parallel.
STACS' inventors designed the specialized index to understand how data in the files - properties of "events" in the particle physics case - relate to requested queries.
By deriving advance information on all the files needed for the query, STACS can grab files before the query processing. This makes applications seem as swift as if the files were in disk cache when they were needed.
Shoshani says business-intelligence users will need something like STACS if they continue amassing data at current rates.
But it's doubtful that even the most data-rich Web site can compare in storage needs with the physics community's next big assignment: the Atlas Project, a high-energy physics accelerator that will begin producing in 2005 up to two petabytes of data per year.
Future computers may use only very large-capacity disks to handle even the largest jobs. Until then, Shoshani says, there's tape.