Economics of Book Digitization
Digitizing books still has some challenges, but I believe the economics of it are clear. Nonetheless, some misunderstandings persist. I’d like to review some of the most basic facts about book digitization that I’ve learned over the past seven or so years.
Most attention is paid to the cost of scanning (photographing the pages and processing them), but I cannot emphasize enough that the greatest costs of building a digital library are those borne by the brick-and-mortar libraries. Libraries spend billions each year building, curating, and maintaining their collections. So, the real value, and costs, are in the books and the libraries. This aspect is too often overlooked and undervalued.
As for the cost of scanning books, let’s look at some numbers.
- The Million Books Project in China cost around $6/book.
- Google’s library project I estimate to cost well below $10/book, maybe as low as $5/book.
- The Internet Archive scans books at a cost of 10 cents/page or $30/book. It is more expensive but you get superior quality–I may be biased, but check it out–and that cost also covers periodically reprocessing the books based on new techniques and technologies as well as perpetual storage.
- All of these projects produce page images for reading, optical character recognition for searching, and access formats like pdf and on-screen viewing.
As for the number of books that have been scanned:
- Google is now presenting 7 million books scanned books, which I would estimate to represent a $35-70 million project. (They have likely scanned many more books than those they are presenting.)
- China’s government has scanned 1.4 million books for $9 million. They have told me they are going to scan another 3 million books starting this summer.
- India’s government has scanned 600,000 - 1 million books, but I don’t have any indication of their costs.
- The U.S. government has scanned probably fewer than 100,000 books. Clearly, the US government has a “scanning gap” relative to other governments.
- Together, U.S. foundations such as the Sloan Foundation, Microsoft, and Yahoo together helped the Internet Archive and Kirtas to scan 600,000 books, for about $14 million.
- There are now nearly 1.3 million public domain books from various projects on archive.org, which are full-text searchable on openlibrary.org.
So, putting these two sets of figures together, the #1 takeaway from my adventures in book digitization is that building a great library of digital books the size of Harvard or the Library of Congress would require a one-time cost of $300M, for the highest quality scans. $300 million is a small price tag in the scheme of things. As federal spending goes, it’s a drop in the bucket (remember the $231 million Bridge to Nowhere?).
The US library system costs $12 billion a year (with $3-4 billion of that going to publishers’ products). To give just one example, Cornell’s library has an annual budget of $55 million.
I believe that if just 100 top libraries in the US were to put 5% of their acquisition budgets into digitizing, we could have a 10 million-book digital library done in about 5 years.
We now have over 3 million books in the growing public digital libraries. This is an alternative to the private single-access digital library Google is building.
We can build something great if we keep focused on the dream–a library and publishing system that enables communities to thrive through the meaningful sharing of works.
Tags: digital libraries, publishing
March 23rd, 2009 at 3:09 pm
[...] Brewster begins the post: Digitizing books still has some challenges, but I believe the economics of it are clear. Nonetheless, some misunderstandings persist. I’d like to review some of the most basic facts about book digitization that I’ve learned over the past seven or so years. [...]
March 24th, 2009 at 4:58 am
[...] Brewster begins the post: Digitizing books still has some challenges, but I believe the economics of it are clear. Nonetheless, some misunderstandings persist. I’d like to review some of the most basic facts about book digitization that I’ve learned over the past seven or so years. [...]
March 24th, 2009 at 10:17 am
[...] Kahle, over at the Open Content Alliance, has an interesting post about the cost of digitizing books. His overall take: it’s very cheap, especially relative to the cost of maintaining brick [...]
March 29th, 2009 at 8:06 pm
yes. our librarians have thus far demonstrated
a lack of will which is astounding and pathetic…
-bowerbird
April 5th, 2009 at 1:10 am
[...] Economics of Book Digitization de pe blogul Open Content Alliance. [...]
April 12th, 2009 at 5:03 am
[...] Open Content Alliance (OCA) 的部落格上整理了一些有關書籍數位化的數據,例如掃描成本、冊數…等。不過作者認為這些成本都比不上一座圖書館在建築物以及館藏購置及維護上的成本,但後者卻常常被忽略或低估了。下面就是這些數據(OCA 以外的數據都是推測的): [...]
April 15th, 2009 at 6:45 am
Do you have any details on the Chinese and Indian government scanning efforts? Where are these books available?
April 15th, 2009 at 6:48 am
Some of your captcha’s are UNREADABLE! … Others are crystal-clear