The Ultimate Archives

Four centuries from now, if a historian wants to read Al Gore's or George W.

Bush's inaugural address from January 2001, he or she should be able to find it in a snap in the online electronic records archive now being developed by the U.S. National Archives and Records Administration.

"The goal is to preserve digital information for at least 400 years," say researchers from the San Diego Supercomputer Center, who have provided much of the scientific brainpower behind the project.

As the federal government shifts more of its work from paper to electronic documents, the National Archives must radically rethink long-term preservation of records. Computers and formats rapidly become obsolete, rendering documents created just a few years ago unreadable. The problem is how to make documents readable centuries from now, when computers beyond imagining today are likely to be in use.

"It has been described as the archival equivalent of the first moon shot," said John Carlin, archivist of the United States.

Carlin and other archives officials are confident they will have a pilot version of the electronic records archive in operation by 2004 or 2005, at an estimated cost of $130 million.

The Migration Problem

Until recently, the Archives' attempt to build such an electronic archive seemed like a technically impossible dream: In theory, obsolescence can be overcome by migrating electronic data to more modern systems. But at the present pace of evolution, software used to manage archival collections changes every three to five years. Combine that rapid rate of obsolescence with the explosive growth in the number of electronic records, and mass migration, in reality, is impractical.

"The time needed to migrate to new technology may exceed the lifetime of the hardware and software systems that are being used," eight scientists from the San Diego Supercomputer Center wrote in a technical paper describing the new electronic archive.

The migration problem is further complicated by archival rules of order.

Official records must remain authentic. That means their contents can't change, and in most instances, neither should their appearance. Paper records always look the same, but electronic records can look very different - or become incapable of being viewed at all - if the software needed to display them properly no longer exists.

That's already a problem for documents created a decade or so ago in formats that are no longer used. "Electronic records are only as good as they are authentic," said Reynolds Cahoon, assistant archivist of the United States and head of the effort to create an electronic archive. "If they aren't authentic, everything is for naught."

Records exist in thousands of formats, and the challenge of keeping up with new ones as they come out and old ones as they are discarded quickly becomes insurmountable. So the archivists concluded that the best way to solve it was to avoid dealing with formats altogether.

Finding the Right Language

Carlin dramatized the solution in March, when, while presenting U.S. Congress with his 2001 budget request, he announced that two years of work by computer scientists had led to "a major technological breakthrough" in storage technology for electronic records.

Researchers, he said, had developed methods for storing electronic records that promise to preserve them for hundreds of years and keep them readable despite the obsolescence of the software and hardware used to create them. Three years ago, scientists would have said it couldn't be done, Carlin said.

"But now they have demonstrated it to us and given us confidence that in three to five years we will be able to deal with the massive volume of federal records in various formats and from various generations of technology," he said.

Working with the San Diego Super-computer Center, Georgia Tech Research Institute and several other government agencies, the Archives has discovered a method that promises to permit storing records "totally independent of their software and hardware," Cahoon said.

A process called "persistent object preservation" appears capable of stripping the display characteristics of any electronic document - whether text, spreadsheet, photo or map - and storing it in a format that will allow it to be called up by whatever software is being used in the future.

The format of choice is Extensible Markup Language or XML, a standard language for transmitting data from one computer to another. "Tags" within XML documents tell the receiving computer how to read and format the data.

Here is how the electronic archive would work: An incoming electronic document would be converted into an XML document. This involves identifying the components of the document using XML document type definitions, replacing proprietary or nonstandard formats with XML tags and preserving information about the document's appearance.

XML tags will also make it easier for search engines to locate documents after they are stored. For example, e-mail messages in XML could be searched by the names of senders and receivers, while omitting names mentioned in the message's text. Document type definitions will also make it possible to link related documents in groups or collections of records, a key requirement in archiving.

Once converted to XML and tagged, documents would be stored in a "container," which in turn is stored in a "repository." For now, the container is a 100G tape cartridge, but that is likely to change as new storage technology is developed. The physical repository is a robotic storage warehouse - or multiple warehouses scattered nationwide and linked electronically.

Presiding over the repository is a computerized "storage resource broker," which functions as middle-ware between the repository and applications used to store and retrieve records. The storage broker retrieves records and uses document type definitions to reassemble collections of records, wherever they are in the archive.

Still a Theory

So far, a test version of the electronic archive has passed a number of hurdles, including one that involved taking in a million e-mail messages, converting them to XML documents, tagging them, storing them and calling them back up. The process took less than two days, Achives officials say.

"We can prototype the concept and make it work," Cahoon said. "But we are nowhere near ready to assemble" an archive as large or complex as the national electronic archive will have to be.

Even when the electronic archive is up and running, work on it won't be finished, he noted. "You can't just build this once; it's never done. Parts will become obsolete, so you have to constantly evolve. It's designed so any piece of the system can be exchanged for new components" and still be compatible with the XML-based application of the other components.

But the burden of constant upgrading is also a benefit. As computing power increases, its price declines. The archivists are counting on that trend to make it economically possible to keep up with the swift-rising volume of records that must be stored, Cahoon said.

The U.S. Department of Veterans Affairs is one of the agencies that could benefit early from the electronic archive project. On a daily basis, the VA needs access to veterans' records to process claims and determine eligibility for benefits. "We spend a good amount of time trying to track down records," said VA spokesman Steve Westerfeld. Determining eligibility often takes months.

"We're in favor of anything that allows easier access and enables us to get hold of records quicker and serve veterans better."

"The challenge we face as records move more and more to electronic is how access is going to be provided," Carlin said. The electronic archive is "on the cutting edge of research and technology. Nothing comparable has ever been done."

Join the newsletter!

Error: Please check your email address.

More about Evolve

Show Comments