The older bits of the world’s accumulated knowledge, bound together in volumes of printedbooks and magazines, are slowly disappearing. Out-of-print renditionsoften disappear forever. Libraries with limited shelf space oftenreplace seldom-used titles with newer tomes. A far smaller portion ofprinted matter makes it to page-scanning processes for preservationin digital form.
In the race to build a universal digital library, many importantbooks and documents are being left behind: special editionbooks, religious books, historical documents, and books found in smalllocal libraries or in private collections. Left undigitized, the information inside them will fade as the paper deteriorates.
Despite the best efforts of organizations intent on creating exhaustive digital libraries of all human knowledge, their projects are still too fragmented toproduce a reliable, universal, digital repository of all printedgoods. Often, corporate decisions and budgetary considerations mean books are left behind.
Google’s Book Project and Project Gutenberg are two of the morewell-known efforts to convert the printed page to a digitally viewableform. Usually, large libraries and university research directors formalliances to take on the challenge of digitizing their own collections. Their ‘leave no book behind’ mentality is filtering downto smaller businesses with limited revenue, driven by improvements inscanning and storage technologies. This is creating a balance ofpower, so to speak, that allows those without the reach and capital ofGoogle to join in the digitization movement.
“People are doing this with scanners of all kinds. Hardware is gettingcheaper and better. Nowadays, a lot of it is done with digital cameras.They have high enough resolution today to give very good results. It’salmost like going back to the microfiche days,” John Sarnowski,director of The ResCarta Foundation and director ofImaging Products for Northern Micrographics, told TechNewsWorld.
Flawed System
Northern Micrographics is a service bureau that converts paper andfilm into electronic format. The company has been digitizing printedpages since the early 1980s. In that time span, Sarnowski has observed a bigmisconception about how the process works. Contrary to popular belief, the job of converting from physical page to digital screen does not end with the scanning or camera image.
“Shooting the pages is only one-fifth of the job,” he explained. “There is a lot oftechnology built around getting the page numbers and the physicallayout to match the original printing. The rest of thejob involves getting the metadata right.”
That involves a detailed process of making the pages match. Forinstance, you cannot have all the pages in digital form listed in theobscure number that the scanned file or camera image usually generates, such as00001.scn. Page inserts, titles, author, and otherdata have to be coordinated in the finished digital product.
That’s been the problem since day one, and it’s what the technology hasto overcome, according to Sarnowski.
Your Way or Mine
Another part of the problem in digitizing printed books intoelectronic media is the end-user format. There is no standard protocolfor viewing digitized conversions so that anybody with access can readthem.
For example, in 1994, Sarnowski’s company got involved with Cornell University andthe University of Michigan on one of the earliest digital conversion projects in the United States. Ocular Character Recognition (OCR) initially cost US$14 perpage, but as the technology got better, the cost dropped. At the end of the three-and-a-half-year project, it was down to afew cents per page.
When the company asked school officials how they wanted the data back, the officials responded, “How would you like to send it back?” said Sarnowski.
“There were no standards then. There still aren’t. The library peopleat Cornell didn’t know how to extract the data out of their databasesystem so we could integrate the digital pages. We had to work out allof those details,” he explained.
Web No Solution
The Internet is not a truesolution to providing universal access to a digital book library, either.Standardization does not always exist on the Web.
To see the problem, think of the digitizing process in terms of other technology. For instance, you can put a sound file inan MP3 player anywhere and it works, as there is only one standard. Notso with DVDs. Different parts of the world have regional codecs withtheir own file formats for video.
The same lack of universal standards plagues those working to create auniversal digital book library.
“The big problem at every major research center, including Google, isthere is no standard for dealing with digital pages. To this day, westill do not how how Google is storing the book data and what theirformat is,” Sarnowski said.
Starting From Digital
Some publishers start out in the digital form, so printed books do nothave to be converted. While this approach does not solve the problemof saving books left behind, it at least does not add to that problem.
In the case of publishers such as Springer Science + Business Media, authors must now submit their manuscripts in Microsoft Word or a similar software file format. The company publishes all of itscollections in both PDF (Portable Document Format) and XML (Extensible Markup Language).
“We did digitize all of our journal collections all the way back tothe 1840s. We sent the physical pages to a vendor who made themavailable digitally through a scanning process. Somebody was insertingthe metadata during that process,” George Scotti, global marketingdirector at Springer, told TechNewsWorld.
Saving Specialties
Springer does not worry about intellectual property theftinvolving its easy-to-get digital library offerings. The collection isnot mainstream reading. Still, it is available on Amazon’sKindle e-book reader and other such devices.
Springer specializes inpublishing scientific research. Since researchers already do most oftheir work online, the company’s customers are usually familiarwith the electronic format, according to Scotti.
“We have a very liberal DRM (digital rights management) policy. Onceyou buy the content, you can do whatever you want with it. We’ve onlyhad a few cases where it was a problem putting it on a Web site. Butit’s not causing us a great deal of concern,” Scotti said.
Everybody Else
Another solution in the digital mix, offered by Atiz.com, could be ideal for small companies and individualauthors who want to preserve their printed pages digitally.
As long asthe user owns the copyright, there is no legalentanglement, according to Atiz President Nick Warmock. The company’s biggest customers include academic libraries aroundthe world, municipalities for deed registries, students and servicebureaus.
Three of Atiz’s products give consumers and small organizationsan inexpensive device to make their own decisions on what to preservedigitally rather than going through outside services like Gutenberg and Google. In 2006, Warmock partnered with anassociate who invented a way to have a mechanical arm turn the pagesof books being scanned. The resulting BookDriveDIY (Do It Yourself) includes the cameras, mechanicalsetup and proprietary software. A related product released in 2007,BookSnap, targets students and others who want to digitize reams ofnotes. Atiz released BookDrive Pro in January of this year. The productprices range from $1,595 to $15,000.
“We envision one day having a searchable repository for all digitizedcontent. But that hasn’t been worked out yet. The power of such auniversal library would be incredible. We’d like to get involved inthat project, but too many things would have to be worked out,”Warmock told TechNewsWorld.
Ultimate Solution?
The encumbrances blocking asingle set of standards — and the financial costs associated withforming a universal digital library — may be solvable, according to Sarnowski. He heads the ResCarta Foundation, a nonprofit organization established to encouragethe development and adoption of a single set of open communitystandards for digital document warehousing.
Northern Micrographics, partially in conjunction with the foundation,promotes an open source raster format. The companyoffers open source tools free to download in an effort to encouragethe use of a standardized data format. The strategy includes workingwith metadata standards and the same standards the Library ofCongress uses.
“We’re fighting for the long-term preservation of data. We’re fightingto stop the loss of original data. It’s been an uphill battle for fiveyears to convince people at large institutions to adopt our system.We’re waging a guerrilla war. We’re saying, do it this way,” saidSarnowski.
Biggest Worry
The digital divide problem may not go away. In fact, Sarnowski worries, itcould become worse. “Twenty years from now, when the next generation of storage comesalong, we’re going to have to move all this stuff. If you only had ahandful of standards, you could run them through a converter to makethe move — but that isn’t the case.”