Economics of Book Digitization

March 22nd, 2009 by jeff kaplan

Digitizing books still has some challenges, but I believe the economics of it are clear.  Nonetheless, some misunderstandings persist. I’d like to review some of the most basic facts about book digitization that I’ve learned over the past seven or so years.

Most attention is paid to the cost of scanning (photographing the pages and processing them), but I cannot emphasize enough that the greatest costs of building a digital library are those borne by the brick-and-mortar libraries.  Libraries spend billions each year building, curating, and maintaining their collections.  So, the real value, and costs, are in the books and the libraries. This aspect is too often overlooked and undervalued.

As for the cost of scanning books, let’s look at some numbers.

  • The Million Books Project in China cost around $6/book.
  • Google‘s library project I estimate to cost well below $10/book, maybe as low as $5/book.
  • The Internet Archive scans books at a cost of 10 cents/page or $30/book. It is more expensive but you get superior quality–I may be biased, but check it out–and that cost also covers periodically reprocessing the books based on new techniques and technologies as well as perpetual storage.
  • All of these projects produce page images for reading, optical character recognition for searching, and access formats like pdf and on-screen viewing.

As for the number of books that have been scanned:

  • Google is now presenting 7 million books scanned books, which I would estimate to represent a $35-70 million project. (They have likely scanned many more books than those they are presenting.)
  • China’s government has scanned 1.4 million books for $9 million.  They have told me they are going to scan another 3 million books starting this summer.
  • India’s government has scanned 600,000 – 1 million books, but I don’t have any indication of their costs.
  • The U.S. government has scanned probably fewer than 100,000 books. Clearly, the US government has a “scanning gap” relative to other governments.
  • Together, U.S. foundations such as the Sloan Foundation, Microsoft, and Yahoo together helped the Internet Archive and Kirtas to scan 600,000 books, for about $14 million.
  • There are now nearly 1.3 million public domain books from various projects on, which are full-text searchable on

So, putting these two sets of figures together, the #1 takeaway from my adventures in book digitization is that building a great library of digital books the size of Harvard or the Library of Congress would require a one-time cost of $300M, for the highest quality scans. $300 million is a small price tag in the scheme of things. As federal spending goes, it’s a drop in the bucket (remember the $231 million Bridge to Nowhere?).

The US library system costs $12 billion a year (with $3-4 billion of that going to publishers’ products). To give just one example, Cornell’s library has an annual budget of $55 million.

I believe that if just 100 top libraries in the US were to put 5% of their acquisition budgets into digitizing, we could have a 10 million-book digital library done in about 5 years.

We now have over 3 million books in the growing public digital libraries. This is an alternative to the private single-access digital library Google is building.

We can build something great if we keep focused on the dream–a library and publishing system that enables communities to thrive through the meaningful sharing of works.


Bookmark and Share

Does Richard Sarnoff Think the Google Settlement Is Anti-Competitive?

February 24th, 2009 by mary murrell

According to arstechnica, Richard Sarnoff, the chairman of the Association of American Publishers, in a public presentation at Princeton University, seems to have admitted that the Google Book Settlement is anti-competitive. The piece reports that …

Sarnoff said that the publishers he represents didn’t set out to create a monopoly in the markets for book search engines or online book sales. But he didn’t deny that the settlement could have that effect. After all, he noted, “copyright itself is a monopoly.”…

Sarnoff said that the structure of the registry will be “tough to replicate for [Google’s] competitors.”

and, finally,

Sarnoff also speculated that … [l]egal hurdles may make it infeasible for any other firms to build a search engine comparable to Google Book Search.

Is the Settlement itself one of these legal hurdles?

It’s All About the Orphans

February 23rd, 2009 by jeff kaplan

The Internet Archive first used the term “orphan” to describe books that are no longer commercially viable, (“out of print”); still in copyright; and whose ownership is either impossible or extremely difficult to determine. In 2004 Larry Lessig, Rick Prelinger, and I brought a suit to make it easier for orphans to enter the public domain (Kahle vs. Gonzales). As that case was proceeding, the Copyright Office held hearings and issued a report, which led to proposed orphan works legislation in both the House and Senate.

As that legislation has been wending its way through the Capitol Hill meat grinder, it turns out that Google, the AAP, and the Authors Guild were negotiating their own private solution to the problem of orphan works. After digesting the proposed Google Book Settlement, it becomes clear that the dizzyingly complex agreement is, in essence, an elaborate scheme for the exploitation of orphan works. The class action mechanism allows the Authors Guild (8,500 members) and the AAP (260 members) to extrapolate themselves to include millions of unfindable and unknowable rightsholders to orphan works.  It is to this end–the certification of a class that includes the orphans–that the parties need the blessing of the court.

The upshot, if the Settlement is approved, would be legal protection for Google, and only for Google, to scan and provide digital access to the orphan works. Presto! Like magic, Google proceeds without any need for legislation: their own private orphan works legislation.

So, should the Settlement be approved, Google will be handed exclusive access to the orphans, and the public loses out. With orphan works legislation, orphan works could have been opened up to digitization by anyone: not just Google but competitors to Google, libraries, Open Content Alliance partners, and others. Now, however, no one but Google will have access to the orphan class created by the Settlement, without enduring a similar class action lawsuit from the authors and publishers.

I, personally, am amazed at this creative use of class action law. The three parties have managed to skirt copyright law, bypass legislative efforts, and feather their own nests–all through the clever use of law intended to remedy harms.

This Settlement, if approved by the judge, will accomplish things appropriate to a legislative body not to private corporate board rooms. Let’s live under the rule of law, as arduous as that might be, and free the orphans, legitimately, not for one corporation but for all of us.


Bookmark and Share

Google Books Acquisition Division: gBAD

February 18th, 2009 by brewster

Please forgive a light post on a heavy subject.

I went to a meeting of people concerned about the sweeping nature of the Google-AAP-Authors Guild settlement.  My favorite interaction was when the group was trying to figure out what the “Books Rights Registry” really is (BRR sounds so benign).

Since the settlement immunizes only Google’s scanning, lives off of Google’s money for the foreseeable future, and helps to find more things for Google to scan, this name was proposed:

“Google Books Acquisition Division.”   Pretty great.   Another added that the acronym is funny as well:  gBAD.


Over Half of All Yiddish Literature Now Online

February 7th, 2009 by brewster

As discussed at the October “Using Digital Collections” meeting in San Francisco, the Yiddish collection is now online, and announced in the New York Times.

Magic Untapped

January 26th, 2009 by jeff kaplan

Peter Brantley has written an inspired post on what’s really wrong with the Google Settlement: it lacks imagination.

An excerpt (but read the whole thing):

The settlement describes a world of time past, not a world of possibilities. … Let us imagine an alternative world where children routinely carry Alexandria in their hands. Where they experience works of literature as games, pushing at the borders of their knowledge and experience by engaging the library with others as a festschrift…. Let us say: we want our citizens to remake these books. We shall allow unceasing access to all books within our libraries; there shall be no barriers between them. Read the rest of this entry »

A Monopoly dressed in a Class-action Suit?

January 25th, 2009 by brewster

Dan Clancy, head of Google Book Search, presented and took questions at the American Library Association conference Jan 24, 2009.
Read the rest of this entry »

Is OCLC Reconsidering its Proposed Records Policy?

January 14th, 2009 by jeff kaplan

In a press release dated January 13, OCLC announced the creation of a Review Board to advise OCLC on the principles and best practices for sharing library data. At the same time, the proposed records policy effective date has been put off until the third quarter of 2009. Read the rest of this entry »

A Raw Deal for Libraries

December 6th, 2008 by jeff kaplan

One of the most surprising, even shocking, features of the Google-AAP-Authors Guild Settlement is how hard it is on libraries. Given that Google Book Search could not have gotten off the ground without the cooperation of various university libraries, it is particularly disheartening that the proposed settlement treats them with such an iron fist at the same time as it expects them to foot much of the bill through subscriptions. It will be interesting to see how many libraries continue as partners, given Google’s bait-and-switch. Read the rest of this entry »

Libraries: We Need Them More than Ever

December 1st, 2008 by mary murrell

Marjorie Kehe at the Christian Science Monitor reminds us of the importance of libraries, especially in tough times.

Recommended Changes to Google Book Search Settlement

November 24th, 2008 by mary murrell

New York Law School professor James Grimmelman has written an impressive, even-handed blogpost about the Google-AAP-Authors Guild settlement in which he lays out five principles to guide the court and the public. He closes with fourteen “recommendations” to the court:

Read the rest of this entry »

A Useful Guide to Google Settlement

November 20th, 2008 by mary murrell

The Association of Research Libraries (ARL) and the American Library Association (ALA) have released a useful 22-page summary of the key points of the Settlement entitled “A Guide for the Perplexed: Libraries and the Google Library Project Settlement” and written by Jonathan Band, JD. From the ARL website:

The guide is designed to help the library community better understand the terms and conditions of the recent settlement agreement between Google, the Authors Guild, and the Association of American Publishers concerning Google’s scanning of copyrighted works. … The guide outlines and simplifies the settlement’s provisions, with special emphasis on the provisions that apply directly to libraries.

The guide doesn’t evaluate, criticize, or take a stand toward the Settlement, but it is a thoughtful and careful guide.

how to earn money fast

New Contributors to Open Content Alliance collections

November 9th, 2008 by linda frueh

Through all the changes in 2008 libraries and other cultural institutions continue to contribute their works to universally accessible open collections.  Fifty one new institutions have joined the ranks of our community! Read the rest of this entry »

Philadelphia Mayor Closing 11 of 54 Branches

November 7th, 2008 by linda frueh

The Library Journal reports that 11 branches of the Philadelphia public library system will be closed because of financial hardship.  This is the latest painful signal of change in the library community, and bodes ill for continued broad public access to library materials.  We see a potential role for the Open Content Alliance to step into this gap by sharing its diverse collections through loaning of digital books and free, unrestricted access of materials that are out of copyright. Read the rest of this entry »

Let’s Not Settle for this Settlement

November 5th, 2008 by jeff kaplan

Rather than accept the Google settlement with publishers and authors as a fait accompli, or as an obligatory blueprint for the future, the appropriate response is to consider its implications for the future and take all steps to build the world we want to live in. Although the settlement may solve some immediate problems for the parties to the lawsuit, and perhaps some of the contributing libraries who have enabled it, we should not assume that Google Book Search is the only way, or even the best way, to organize and make available our cultural heritage.

This post will outline some of the issues.  Next step is to build an appropriate response, to which we welcome input.  Losing access and control of our cultural heritage as part of a digitization wave is not acceptable.

At its heart, the settlement agreement grants Google an effective monopoly on an entirely new commercial model for accessing books. It re-conceives reading as a billable event.  This reading event is therefore controllable and trackable.  It also forces libraries into financing a vending service that requires they perpetually buy back what they have already paid for over many years of careful collection.

Read the rest of this entry »

New OCLC Records Policy Generates Debate

November 5th, 2008 by linda frueh

OCLC has announced a proposed new records policy to take the place of its guidelines, effective mid-February 2009.  We understand that there was a version of the policy published on November 2 that was hastily withdrawn in the face of great member pressure.  Terry’s Worklog has an excellent ongoing discussion and analysis of the proposal’s implications. Read the rest of this entry »

More Musings from the Community on the Google Settlement

November 5th, 2008 by linda frueh

Many OCA denizens are weighing in on the proposed settlement as they have more time for analysis.

Read the rest of this entry »

Harvard University Libraries Opt Out of Google Books Settlement

November 1st, 2008 by linda frueh

“As we understand it, the settlement contains too many potential limitations on access to and use of the books by members of the higher-education community and by patrons of public libraries,” Harvard’s university-library director, Robert C. Darnton, wrote in a letter to the library staff. Read the rest of this entry »

Using Digital Collections Workshop Attracts Advocates for Open Access

November 1st, 2008 by linda frueh

The Internet Archive and Open Content Alliance (OCA) sponsored a two-day workshop in San Francisco on October 27-28 to share progress and plans for the continued growth of open web access to digital books. The meeting featured scholars discussing the power of complete digital collections in the disciplines of Classics, Yiddish Studies and Biodiversity, as well as updates by OCA contributors on their scanning and access activities.

Two new services to be launched through the website were debuted at the meeting: Scan on Demand and Print on Demand.

Read the rest of this entry »

Open Access Digital Library Summit Held in Boston

October 15th, 2008 by linda frueh

The Boston Library Consortium, in cooperation with the Alfred P. Sloan Foundation sponsored a Summit Meeting on September 24-25 for the purpose of discussing and promoting an agenda of open access for library resources. The Summit featured university provosts, faculty and non-profit advocacy organizations, as well as an introductory message of support from Senator Christopher Dodd (D)- Connecticut.

Read the rest of this entry »