A record is a record, whether it's a sheet of paper, an e-mail, an electronic document or a digital image.
"It's the content that drives retention, not the media it's written on," says Adam Jansen, a digital archivist for the state of Washington. And recent federal regulations are requiring more companies to save more content for longer periods of time.
While content may be king in theory, in practice, the media on which it's stored and the software that stores it present problems. As digital tapes and optical discs pile higher and higher in the cavernous rooms of off-site archive providers, businesses are finding them increasingly expensive to maintain.
The software that created the data has limited backward compatibility, so newer versions of a program may not be able to read data stored under older versions.
Moreover, the media on which the data is stored degrade relatively quickly. "Ten years is pushing it as far as media permanence goes," says Jansen.
Today, the only safe path to long-term archiving is repeated data migration from one medium and application to another throughout the data's life span, experts say.
But the storage industry is working on the problems from various angles.
One solution to the backward-compatibility problem is to convert data to common plain-text formats, such as ASCII or Unicode, which support all characters across all platforms, languages and programs. Using plain-text formats to store data enables virtually any software to read the files, but it can cause the loss of data structure and rich features such as graphics.
Another approach is to use PDF files to store long-term data. There can be backward-compatibility problems with PDFs, but the file format's developer, Adobe Systems, has created an archival version of its software, called PDF/A, that addresses them.
To date, the most promising standard data-storage technologies are emerging in new XML-based formats, according to analysts and studies. XML is a file format and self-describing markup language that is independent of hardware and operating systems.
On the media side, the Storage Networking Industry Association (SNIA) is working toward solving what it calls the "100-year archive dilemma" through a standards effort for media. The goal is to store data in a format that will always be readable by a generic reader.
"Degrading media is not at all the issue. Rather, the real issue is long-term readers and compatibility -- the logical problem which we intend to address," says Michael Peterson, president of Strategic Research, and program director for the SNIA Data Management Forum.
Some businesses are postponing the long-term archival problem with large farms of disk arrays, which keep data online and accessible. Jim Damoulakis, chief technology officer at Framingham, Mass.-based consultancy GlassHouse Technologies, suggests that companies look into using an emerging class of inexpensive disk arrays as a storage medium. "At least you know the data is there and readable," he says. "A tape or optical media sitting in a vault can degrade."
The new disk arrays, sometimes called disk libraries, are based on relatively inexpensive ATA disks, formerly used only in PCs.
Peterson says that this is a temporary solution, however. "Long term, I am not sure that current disk interfaces won't have the same migration problem [as tape]," he says. "Whether it is tape or disk, you are going to have to migrate."
Meanwhile, users struggle on. Last October, for example, Jansen and his IT team completed a three-year project to create an open-systems-based archive management center for the state of Washington that will house records from 3,300 state and local agencies in perpetuity.
The center currently stores 5TB of data and is expected to grow to 25TB by the end of the year. It cost about US$1.5 million for management software and hardware, including servers, a storage-area network and tape drives. Washington spent US$1 million more on a joint development project with Microsoft, which is helping the state create what it hopes will become an open format.
"We want to avoid proprietary file formats to the extent it's possible," Jansen says.
He says that the most important part of any long-term archival system is centralising the backup of data in order to be able to standardise the storage method. At the heart of the state's archival system is the storage of metadata, the information that describes the data.
When documents are transmitted over the WAN to a central data center, information such as who created the document, what type of document it is, where it was created, when it was created and why is was created is captured and stored in a SQL database. That way, "20 years from now, you don't have to know that particular document, but you can perform a search based on the record type," Jansen says.
The state's system also notes which computer originated the data. "We capture the actual IP address, CPU type and Ethernet adapter. We get the digital fingerprint of that computer," says Jansen. This helps to prove the authenticity of data. In addition, the state issues a digital certificate for any document using the MD5 hashing algorithm to verify the authenticity of that data.
Most data is kept in a standard format: Word documents are turned into PDF files, and images are converted into TIFF files.
Jansen says he is considering using Microsoft's Office 12 and its new XML-based file format as a standard archiving format in the future.
And virtually everyone hopes that standard -- or another one -- will stick. Peterson sums up the 100-year dilemma this way: "There aren't what we'd call standards for long-term archiving -- only best practices."
Before you archive
As organizations struggle with the physical problems associated with archiving, many are also addressing the theoretical underpinnings. They are beefing up their policies around how they classify and store data, partly in response to regulations such as the Sarbanes-Oxley Act.
"Unquestionably, the foundation of any archiving system is strong records management skills," says Adam Jansen, a digital archivist for the state of Washington.
And while the development of products and standards will help companies as they deal with backward compatibility of software and degradation of media, records management is something they can begin to tackle today.
Any archival scheme should start with creating an audit trail to ensure the authenticity of the data, says Jim Damoulakis, CTO at GlassHouse Technologies. The plan should also include categorizing data according to its importance, which can dramatically affect the cost of the systems.
"Without an archiving strategy in place -- and that's common today -- your entire storage infrastructure will be eaten up over time with legacy data," he says. "Going through the exercise of doing some level of data identification and classification is a critical first step."
Mario Carlos, head of IT at Manila Electric Co in the Philippines, says he began to formulate a long-term preservation plan by prioritizing his data. His priorities are based on regulatory requirements, economic feasibility, operational ease, obsolescence, available technology and the difficulty of changing current operations.
To assist in records management, information classification management software and appliances have been emerging over the past year from vendors such as Kazeon Systems, StoredIQ, Arkivio, Index Engines and Scentric.
The technology scans unstructured file data and applies lexicons of keywords to identify likely target documents. For example, the engines can be set to identify data related to compliance with Securities and Exchange Commission regulations, or to earmark data for legal discovery.
- Categorize your data according to its importance.
- Create polices about which data to retain and how long to retain it.
- Establish a migration path for your data. Don't upgrade willy-nilly.
- Don't retain data longer than you must.
- Create strong vendor partnerships to defer costs and establish trust.