Keeping paper documents intact for years and years has become a matter of routine for historians and archivists trying to keep a record of history. But saving digital information is turning out to breed its own set of unique challenges.
Because digital files depend on their context — such as the operating systems in which they are stored and the applications that create and access them — users typically must keep all elements of that context to access old digital documents. It is not unusual, for example, for companies to keep old computer systems around just to access old files stored on them.
While software developers do offer some measure of backward compatibility for files created with older versions of their software, almost all software developers eventually stop supporting their earlier formats. If you never update your archived documents to your current software format using the appropriate filters, someday you will no longer be able to access those files.
Saving Digital Archives
Eastman Kodak is one company that is working on ways to save digital archives for future generations. Andrew Lawrence, marketing manager at Eastman Kodak’s commercial imaging group, told TechNewsWorld that the realm of digital documents is much more fluid than the world of traditional paper-based documents.
“Electronic records are composed of binary data,” said Lawrence. “To be complete, the e-record must include content, context and structure, the data for which is frequently distributed across multiple sources, tracked as metadata.”
In other words, for human comprehension, this data must be able to be interpreted by technology to position it for printed presentation or output onto CRT screens. “For ongoing availability,” he said, “all of the distributed data and the metadata must be maintained intact.”
This process contrasts directly with paper documents and conventional microfilm records, which are completely self-contained and do not need elaborate translation technologies.
Growing Problem
In addition, archived paper and microform documents have centuries of legal acceptance behind them, while digital documents do not. Only in the past several years have digital signatures become binding for contracts.
“With each passing day, the reservoir of digital documents grows,” said Lawrence. “Often, there is no associated hard-copy output to archive via conventional means.” The challenge, he said, is to preserve, unchanged, those documents that must survive for very long time periods. “Over time, the problem is that media decays and hardware and software platforms evolve, placing the electronically stored information at risk.”
Lawrence said a referenced archive of permanent document images offers a cost-effective, long-term solution for organizations. By rendering digital information to microfilm as uncoded images, organizations could create technology-proof repositories.
“Those pursuing strategies for archiving digital documents find themselves in a somewhat contradictory situation,” he said. “We want the dynamism of digital processing and manipulation, but we also want the trustworthiness of analog records.” The answer, he said, is to do both in an economical manner that is virtually transparent to the people using the digital archives.
Lawrence said the best approach to creating trustworthy digital archives is to follow a dual track. For short-term needs, users can maintain structured electronic archives in their native formats for a reasonable period of time. For long-term trustworthiness, Lawrence advises users to create a reference archive based on analog renditions of the documents. But even with a reference archive — on CD media or some other archival digital format — users still must deal with the issue of future accessibility.
Easy Conversion
Glenn Widener, director of internet technology at SwiftView, told TechNewsWorld that using easily convertible print formats is the way to go. Specifically, he said the Printer Control Language (PCL) format is a good way to preserve the actual document. PCL, a format invented by Hewlett-Packard for its LaserJet family of printers, is accepted by the majority of printers on the market today.
While Adobe promises to read all back versions of PDF forever, the real issue seems to be single-source formats. “I know of no code to parse and view PDF other than from Adobe,” Widener told TechNewsWorld. “Even big companies like Adobe can disappear.”
According to Widener, there are many PCL viewers available because PCL is easier to read than PDF. “Many PCL viewers can view 15 to 20 years back,” he said. “There will always be commercial tools readily available to read it.”
Archiving Hardware
Dan Schonfeld, director of Products for Artesia, a digital asset-management firm, said that many organizations are more focused on preserving the digital files and are not focused on preserving the viewers, readers and players required to do anything with those files.
He pointed out that Artesia’s software lets users archive viewers, readers and players along with files. “Because we can store any type of media, we can actually store applications as well as the media files themselves,” Schonfeld told TechNewsWorld.
Schonfeld’s technology highlights the fact that the archive problem is not just an issue with applications like viewers and readers. Schonfeld said that, at some point, organizations would need specific versions of an operating system to be able to retrieve certain media files. “You could archive the data CDs, but suppose you even required specific configuration settings?” he asked. “So you’d actually need to archive the hardware as well.”
Schonfeld said that organizations need to keep track of which applications are used to view or read media files. “That’s actually the hard part,” he said. “Based on a mass of bytes, it can be tough to determine the exact version of software required to view or render a file.”
An OCR / microfilm marriage is the solution to the endless obsolesence problem. Electronic documents can be captured on computer output microfilm using OCR friendly characters. Retrieval is accomplished by scanning the documents on the microfilm rolls and using OCR to convert them to the IT technology of the day.
It is not exactly true to say that PDF can only be read by Acrobat. The opensource software Ghostview reads PDF, as does the commercial Quickview program by Inso. There are probably others unknown to me, too. The PDF specification has been published for a long time, so anyone can write a parser if they need or want to.