Skip to main content

Google Library Project - The fineprint

As has been reported quite widely, Google has begun a massive digitization project with five libraries:

The total covered by existing agreements is said to be 15 million. Each is estimated to cost $10 to scan. Stanford's scanning unit is said to be able to do 100,000 pages a day. Oxford's scanning unit is said to be able to do 10,000 books per week. If all of them are that speed then by my math it will take a little over five years to scan them all. Similarly, the University of Michigan says the project will take six years.

Most agreements indicate that the hosting library will get a digital copy of their books, which apparently they will then host for their users. In addition, Google will throw all the books into its Google Print service.

Some books are already available through the service. For example, Books and Culture is an out-of-copyright book from 1896. Note that unlike a publisher-submitted book, you can easily link to or view any page: the cover, the University of Michigan bookplate, page 50, the U of M checkout slip, the back cover. You can also search the full text leading to a standard Google results page with links and snippets. Click on any of the links and the resulting page will highlight your search terms, just like Google Catalog.

Sadly, it seems the only thing not available is the full text of the books. However, it is pretty easy to get the underlying images of the pages (tho not as easy as simply looking at the page, alas) so one could certainly OCR it themselves if they liked, although it'd likely not be as good as Google's work. Things look much worse for in-copyright books. For example, The Role of GATT in Relation to Trade and Development was only published in 1964 and is apparently in-copyright. One can thus only get back practically useless snippets while the fat-cats at Google have the whole thing.

Fortunately, "Google is negotiating with various publishers to facilitate arrangements to make works more easily accessible while providing appropriate protections for copyright holders" for in-copyright library books. It will be interesting to see how much success they have. It's not clear how to search Google for just library books, or even just books, or to find out how many they have, but here are the handful I know about, all from U. of M. (books published after 1923 are copyrighted):

Do you hold the copyright on a book? Does your book have an ISBN? If you answered yes to both these questions, you don't have to wait for all this. You can simply sign up to Google Print, send Google a copy of your book, and they'll scan it in and OCR it for you for free! Then they'll send you checks with all the money your book makes through ads! So please do it! Please?

A closing thought. Much of the discussion around this endeavor has focused on its effect for the largely-affluent and privileged children who go to the major universities from which the books are taken. Will they stop going to the library? Will they miss the smell of dead trees? Will they be able to do research more efficiently? With all due respect, this is the wrong group to think about. The real beneficiaries of this scanning should be the less fortunate people around the world who barely have access to a library, let alone a world-class one. Let us scan these books for them.

By Aaron Swartz ( of Google Weblog