Tuesday, January 24, 2006
Brewster Kahl's Open Library Project pushes imaging envelope -- an alternative to Google
ORIGINAL URL:
http://chronicle.com/weekly/v52/i21/21a03401.htm
The Chronicle of Higher Education -- Information Technology
From the issue dated January 27, 2006
http://chronicle.com
Section: Information Technology / Volume 52, Issue 21, Page A34
Scribes of the Digital Era
A library-scanning project brings public-domain materials online and offers
an alternative to Google's model
By JEFFREY R. YOUNG
The Chronicle of Higher Education
San Francisco
Brewster Kahle is mobilizing an army of Internet-era scribes who are
fastidiously copying books page by page. Unlike the monks who slowly copied
ancient tomes by hand, though, these scribes make digital reproductions, and
they zip through hundreds of pages each hour.
Mr. Kahle, director of the nonprofit Internet Archive, is guiding a
mass-digitization project called the Open Content Alliance, which was announced
in October and is rapidly gaining partners. The alliance plans to take
carefully selected collections of out-of-copyright books from libraries around
the world and turn them into e-books that will be available free to scholars
and anyone else who wants to view them, print them, or even download them to
their own computers.
The project has the backing of Yahoo and Microsoft, and many see it
primarily as a response to the controversial book-scanning project led by
Google (http://print.google .com/googleprint/library.html). Google is
digitizing millions of books from five major libraries, and it says it hopes to
scan nearly every book held by one of those partners, the University of
Michigan at Ann Arbor. Because many of the library's holdings are still
protected by copyright, publishers have challenged the legality of Google's
project.
Although the Open Content Alliance has pledged not to scan copyrighted works
without permission, thereby avoiding that thorny legal issue, the project could
do as much to shake up the library world as Google's effort has. The alliance's
undertaking is more than just a mass-scanning project it is a new model for
cooperation among libraries hoping to build their own digital archives of
public-domain materials. Individual libraries have long worked on digitization
projects on their own, but the new alliance promises to pool the digital
content created by academic libraries. "It's a book-scanning initiative and a
vision for an open library," says Mr. Kahle.
Indeed, the alliance involves far more players than Google's project: So far
34 libraries, most of them at universities, have agreed to join and
contribute material. And the Open Content Alliance will make its digital books
more freely available, putting them online in a way that anyone, even companies
other than Yahoo and Microsoft, can index and search the files, or even
download the books for their own use.
One key to achieving the project's goal of scanning hundreds of thousands of
library books is to keep the price of scanning remarkably cheap with a charge
to participating libraries of about 10 cents per page by scanning the volumes
quickly and accurately. To do that, the project makes use of a specialized
document scanner developed by the Internet Archive and called, appropriately,
the Scribe.
The copying has already begun. In a building in the warehouse district here,
employees of the Internet Archive who operate the book-scanning machines are
working through an initial batch of books selected from the University of
California system. Two more scanning machines are in place at the University of
Toronto, where they run 15 hours a day. The project's leaders hope to have
scanners in more libraries by the end of the year. Each machine costs tens of
thousands of dollars, says Mr. Kahle.
One challenge for libraries, of course, is finding the money to scan large
quantities of books, even at 10 cents per page. Daniel Greenstein, executive
director of the California Digital Library, says he hopes that libraries can
contribute to the project by shifting some of the money they now spend on
digital-book subscriptions to scanning books and adding them to the shared
online collection. Several companies sell access to e-book collections, such as
the Chadwyck-Healey Literature Collections, from the ProQuest Information and
Learning Company.
"We're going to spend the money anyway," Mr. Greenstein says. "Let's spend
it more wisely." The alliance is also trying to entice companies and others to
donate money to the effort, touting the benefits of offering the world's
public-domain literature free to all online. "It will be remembered as one of
the great things that humans have ever done up there with the library of
Alexandria, Gutenberg press, and the man on the moon," Mr. Kahle said at a
kickoff event for the project in the fall.
Difficult Work
At the Internet Archive offices one afternoon, Mr. Kahle demonstrates his
book-scanning machine.
The device, about the size of a photo booth, is draped in heavy black cloth,
with a V-shaped stand in the middle to hold a book open. Two high-resolution
cameras are positioned at the top of the machine, one aimed at each page of the
book's spread. The book is pressed open by a V-shaped piece of glass, which the
machine's operator can raise or lower with a foot pedal. After each pair of
pages is scanned, the operator raises the glass, turns the page by hand, and
then lowers the glass back in place. A computer monitor at the back of the
machine shows the cameras' views of the book pages, and the operator can make
sure the text is lined up in the cameras' sights.
Working the machine is not easy. Putting the right amount of pressure on the
foot pedal, so the glass lifts just high enough to turn pages, can be
difficult at first. Mark Johnson, lead engineer for the Internet Archive, says
the employees who spend their days at the machines get into a rhythm that lets
them scan about 500 pages per hour. "They're amazing. If you watch the people
scanning, it's like an athletic sport."
Once the book pages are scanned, a computer attached to the device
automatically creates digital files that can be displayed and searched. The
high-resolution images include any illustrations and even margin notes that are
contained in the original volume. The machine then sends those digital files
to a server, where they are available on a Web site run by the Internet
Archive (http:// www.openlibrary.org). Copies of the files will also be sent to
the library that lent the book for scanning.
Mr. Kahle says that the books will be given new life in digital form, and
that they can be displayed in a number of ways. The archive has developed an
on-screen interface that makes it easy to read and search each book. But online
users can also request a printed and bound reproduction of a book by paying a
small fee to a company that does the printing and binding. Soon the books may
be able to be printed in Braille or in large print. They could even be
downloaded to PDA's, cellphones, or other portable devices for reading on the
go.
Rick Prelinger, president of the Internet Archive's Board of Directors, says
that even though the materials scanned by the Open Content Alliance will be
free to view or download online, some companies will find ways to make money
with the digital files. "People will pay for enhanced services" such as
printing, he says. "I think the print-on-demand business is going to do very
well."
Let the Scanning Begin
The University of Toronto's libraries have been working with Mr. Kahle since
before the Open Content Alliance formed, and have scanned more books for the
project than any other participants. On the second floor of one of the
university's libraries, in a room that once housed a computer cluster, two of
the scanning machines are in use seven days a week, staffed by employees hired
by the Internet Archive.
Carole Moore, chief librarian at the university, says each machine scans
about 7,500 pages per day. Several thousand books by Canadian authors have been
scanned so far. The volumes were selected in coordination with six other
Canadian university libraries, and the national Library and Archives Canada.
Mr. Greenstein, of the California Digital Library, a project of the
University of California system, says he hopes to eventually place scanners at
the University of California system's two regional storage libraries
warehouselike facilities that are closed to the public but whose books can be
requested through interlibrary loan. Ideally, those storage libraries could
routinely scan each book as it is first deposited, so that patrons could view
the books online instantly rather than have to wait for a printed copy to be
delivered. "We're looking at how much it would cost," says Mr. Greenstein.
Many of the libraries involved in the project have only recently joined and
are still deciding what materials they will contribute. "Every library has some
of those things that no one else has," says Shirley K. Baker, vice chancellor
for information technology and dean of university libraries at Washington
University in St. Louis, which recently joined the alliance. "We have probably
a couple thousand books that are in the public domain that we could digitize
and make publicly available."
Ms. Baker is also interested in digitizing films from the university's
collection to add to the shared online library, including raw footage from Eyes
on the Prize, a well-known documentary on the history of the civil-rights
movement in the U.S. The book-scanning machines won't be necessary for that, of
course, but the Internet Archive has experience digitizing and storing video
and audio files as well, and the archive plans to collect a range of materials
through the Open Content Alliance. "Within this calendar year, we hope to be
contributing at a relatively modest rate, but ramping up over the long run,"
says Ms. Baker.
Hard-to-Capture Materials
José-Marie Griffiths, dean of the School of Information and Library Science
at the University of North Carolina at Chapel Hill, says that her school has
joined the project to experiment with how to better scan manuscripts and
documents that are not in book form. "You can have whole documents, letters,
notes written on fragments of paper," says Ms. Griffiths. "Much of it is
handwritten" and therefore difficult for computers to translate into text form
for searching, she says. "The actual scanning and creating the ability to
search the content is much more challenging for nonprinted, nontypeset
materials."
Librarians from Chapel Hill plan to take a few boxes of such materials to
the Internet Archive soon, she says, to start trying to run them through the
scanners.
Google's book-scanning project, meanwhile, is more restricted, and its
leaders are far more secretive. Google officials have apparently developed a
high-speed book scanner of their own, though they refuse to divulge details of
how it works or say how fast it can scan books. Google also will not say how
many books it has scanned so far from its partner libraries or even describe
the types of books it has added. Such secrecy frustrates many librarians, who
are accustomed to using collections that are carefully delineated. "It is, I
think, important for people to know what they might be able to find," says
Ms. Baker, of Washington University.
Mr. Greenstein says that he has met with Google officials, and that they
seem more interested in grabbing a large quantity of materials than in
carefully selecting certain collections of works. "None of them are
interested in curation," he says, adding that their attitude is "the more of
it, the better." Google is also less open in the way it presents its books. For
those in its collection that are in the public domain, Google allows users to
see the full text, but there is no way to download the data or easily print the
whole book, features that are allowed by the Open Content Alliance.
When asked to respond to those criticisms, Google issued a statement
comparing its scanning project to that of the Open Content Alliance: "We
welcome efforts to make information accessible to the world. The OCA is focused
on collecting out-of-copyright works which constitute a minority of the world's
books a valuable minority, but certainly not complete." Google's plan to
scan copyrighted works without permission from their publishers, while the most
unique aspect of its project, is also the most controversial.
Google officials emphasize that only short snippets of copyrighted works
will be shown to users. Still, members of the Association of American
Publishers have filed a copyright-infringement lawsuit against Google in U.S.
District Court, asking the court to prohibit Google from reproducing their
works and to require Google to delete or destroy records already scanned.
Leaders of the Open Content Alliance say they will scan copyrighted books
only if publishers grant permission first. But participants in the Open Content
Alliance are also quick to credit Google with bringing more attention to book
scanning. "We're just providing another model," says Robin Chandler, director
of built content for the California Digital Library.
"Every generation of scholars looks at past events in a new way," she says,
adding that bringing old books into an easily searchable digital format will
help scholars revisit older works and better make comparisons with more recent
texts. "The idea that you can analyze texts over the centuries is very
exciting."
_________________________________________________________________
Copyright © 2006 by The Chronicle of Higher Education
----------------------------------------------------------------
This article above is copyrighted material, the use of which may not have
specifically authorized by the copyright owner. The material is made available
in an effort to advance understanding of political, economic, democracy, First
Amendment, technology, journalism, community and justice issues, etc. We
believe this constitutes a 'fair use' as provided by Section 107 of U.S.
Copyright Law. In accordance with Title 17 U.S.C. Chapter 1, Section 107, the
material above is distributed without profit to those who have expressed a
prior interest in receiving the included information for research and
educational purposes. If you wish to use copyrighted material from this blog
for purposes beyond fair use, you must obtain permission from the copyright
owner.