Digitizing books still has some challenges, but I believe the economics of it are clear. Nonetheless, some misunderstandings persist. I’d like to review some of the most basic facts about book digitization that I’ve learned over the past seven or so years.
Most attention is paid to the cost of scanning (photographing the pages and processing them), but I cannot emphasize enough that the greatest costs of building a digital library are those borne by the brick-and-mortar libraries. Libraries spend billions each year building, curating, and maintaining their collections. So, the real value, and costs, are in the books and the libraries. This aspect is too often overlooked and undervalued.
As for the cost of scanning books, let’s look at some numbers.
- The Million Books Project in China cost around $6/book.
- Google‘s library project I estimate to cost well below $10/book, maybe as low as $5/book.
- The Internet Archive scans books at a cost of 10 cents/page or $30/book. It is more expensive but you get superior quality–I may be biased, but check it out–and that cost also covers periodically reprocessing the books based on new techniques and technologies as well as perpetual storage.
- All of these projects produce page images for reading, optical character recognition for searching, and access formats like pdf and on-screen viewing.
As for the number of books that have been scanned:
- Google is now presenting 7 million books scanned books, which I would estimate to represent a $35-70 million project. (They have likely scanned many more books than those they are presenting.)
- China’s government has scanned 1.4 million books for $9 million. They have told me they are going to scan another 3 million books starting this summer.
- India’s government has scanned 600,000 – 1 million books, but I don’t have any indication of their costs.
- The U.S. government has scanned probably fewer than 100,000 books. Clearly, the US government has a “scanning gap” relative to other governments.
- Together, U.S. foundations such as the Sloan Foundation, Microsoft, and Yahoo together helped the Internet Archive and Kirtas to scan 600,000 books, for about $14 million.
- There are now nearly 1.3 million public domain books from various projects on archive.org, which are full-text searchable on openlibrary.org.
So, putting these two sets of figures together, the #1 takeaway from my adventures in book digitization is that building a great library of digital books the size of Harvard or the Library of Congress would require a one-time cost of $300M, for the highest quality scans. $300 million is a small price tag in the scheme of things. As federal spending goes, it’s a drop in the bucket (remember the $231 million Bridge to Nowhere?).
The US library system costs $12 billion a year (with $3-4 billion of that going to publishers’ products). To give just one example, Cornell’s library has an annual budget of $55 million.
I believe that if just 100 top libraries in the US were to put 5% of their acquisition budgets into digitizing, we could have a 10 million-book digital library done in about 5 years.
We now have over 3 million books in the growing public digital libraries. This is an alternative to the private single-access digital library Google is building.
We can build something great if we keep focused on the dream–a library and publishing system that enables communities to thrive through the meaningful sharing of works.