Understanding & Assessing The Million Book Project Denise Troll Covey Associate Dean, University Libraries, Carnegie Mellon Pennsylvania Library Association Conference Pittsburgh, PA – October 5, 2003 Million Book Project Vision “Attempt to understand & solve the technical, economic, & social policy issues of providing online access to all creative works of the human race.” – Dr. Raj Reddy What is the Million Book Project? Effort to digitize & provide full-text searching & free-to-read access to a million books by 2007 Collection development Copyright permissions Acquisitions & shipping Scanning operations Proposal writing Why is the Million Book Project? Democratize knowledge & empower citizenry Address disparity in library size & accessibility Facilitate new knowledge Combining old & new, east & west, technical & humanistic Enhance student learning & success of faculty research Address copyright absurdities Why is the Million Book Project? Support digital library research Information distribution, management, & sustainability Security, copyright, & digital rights management Accuracy of optical character recognition (OCR) OCR of non-Romanic languages & scripts Automatic creation of structural metadata Automatic summarization Intelligent indexing Machine translation Storage formats Search engines Who is involved? Carnegie Mellon University Libraries & School of Computer Science Other U.S. libraries Internet Archive India & China OCLC, DLF, & CRL Archival Resource Company Inc. Funding Collection development NSF – $35,000 for initial planning meeting 2001 Funded by partners Copyright permission UC Merced – $35,000 Carnegie Mellon – ? Project administration Carnegie Mellon – ? Equipment & travel NSF – $3.6 million (discounts from Minolta) Labor for scanning India – $1.5 million China – ? Acquisitions & shipping Internet Archive – ? NSF – pilot shipment Collection Development Strategies Librarians as selectors Best books – cited in bibliographies Technical reports & government documents Priorities of participants & funding agencies Topics to get full collections What can we acquire Bulk, cheap, fast Libraries weeding, closing, renovating Initial Collection of Collections In Copyright 100,000 Books for College Libraries & other selected bibliographies November 2001 planning meeting funded by NSF Indigenous Indian & Chinese materials 200,000 Multi-lingual & multi-script language processing Public Domain 700,000 Current Collection of Collections In Copyright 500,000 Indigenous Indian & Chinese materials Books for College Libraries & other selected bibliographies 200,000 Multi-lingual & multi-script language processing New copyright permission strategy Public Domain 300,000 Scanning Underway in India Multiple centers – each with terabyte storage Indigenous materials Shipments from U.S. 100,000 books by 2004 Above average wages 6000 Book Pilot Shipment to India 20 ft ocean container – 25 days NY to Chennai 243 boxes – 9 palettes – 11,298 lbs – duty free Mostly public domain – government documents, social science, biography, history, & literature Approximate cost $2 per book round trip 4000 books did not have to be returned 2000 books were returned in good condition August 2002 to August 2003 Lessons Learned from Pilot Reduce shipping cost per book Change packing method = cost $1 per book or 50 cents per book shipped one way Reduce turn-around time Learned procedures for clearing customs Establish 5 international centers for U.S. shipments • Funded by Indian government • To be inaugurated January 2004 • Receive books, scan, check quality, return Distribution in India Bangalore Allahabad Calcutta Central Distribution Site Deemed University, Thanjavur Pilot Hyderabad Current Delhi Scanning Underway in China Customs & content issues prohibit shipping books to China Scanning indigenous materials & U.S. copyrighted works already in their libraries (with permission granted) Above average wages Standards & Workflow National standards for digital preservation Developed by IMLS 2001 & endorsed by DLF 2002 http://www.imls.gov/pubs/forumframework.htm National standards for cataloging Carnegie Mellon University Libraries Developed & documented workflow Provided training Digitization Workflow Operators scan, post-process, & OCR 600 DPI TIFF (v5) images ScanFix post-processing Abby Fine Reader OCR • 98% accuracy with English • Some foreign languages OCR being developed for other languages & scripts Optimum Scanner Throughput 4000 books per year per Minolta scanner One scanner, two shifts daily = 16 books per day 250 work days per year = 4000 books per year 72,000 books per year with current 18 scanners 400,000 books per year with 100 scanners Allowing 50% deterioration in throughput, 100 scanners can complete the project in 5 years Metadata Workflow Librarians capture Bibliographic metadata - for delivery system • MARC from OCLC or create Dublin Core • Guest IDs provided by OCLC Administrative metadata - for reporting system • • • • • Bibliographic metadata Collaborative development Source library by India & Return requested Carnegie Mellon Copyright status – check renewal records Permission status – used by delivery system Copyright Workflow #1 Identify & contact publishers Negotiate permission No Update publisher database Carnegie Mellon India Update publisher database Yes Locate, acquire, & ship to India Scan & OCR Capture metadata Return books as needed Send copies to Carnegie Mellon Copyright Workflow #2 Locate, acquire, & ship to India Internet Archive & Archival Resource Company, Inc. Scan & OCR Capture metadata Return books as needed Send copies to Carnegie Mellon Carnegie Mellon Copyright Yes & Permission Unknown Identify & contact publishers Update publisher database as needed India Negotiate permissions as needed No Delivery system won't display Yes Update publisher database Update administrative metadata Contextual Searching Legal to scan & create index without permission When no permission granted to display copyrighted book, search returns query terms in OCR context Acquiring the Collection Archival Resources Company Inc. Packing, shipping, & tracking Help locate & acquire books • Weeded collections • Closing libraries Acquisitions web site • Materials wanted • Loaning & donating • Insurance Integrating the Collection Transporting files to Carnegie Mellon Inadequate Internet bandwidth Expense of copying to & from gold CD or DVD Physically transport files on disks Sustaining the Collection Goal is 10 organizations host the Collection India – Digital Library of India multiple locations China – site(s) not yet known U.S. – Carnegie Mellon, Internet Archive, & University of California Merced Discussions with OCLC, Library of Congress, & Digital Library of Alexandria Estimated cost one million $$ per host site Estimated size is 20 terabytes http://www.ulib.org/html/index.html http://www.ulib.org/html/index.html www.dli.gov.in www.dli.gov.in Next Steps Ship 12,500 books one-way to Hyderabad, India from University of Washington @ $6000 Negotiations UMI/Proquest – print-on-demand service OCLC – digital registry & identifying source libraries CRL – supply or help acquire books November 2003 – collection meeting NSF proposal to create database of copyright renewal records Print on Demand Service UMI/ProQuest Handle financial transactions Print, bind, & send books to customers Collaboration with Carnegie Mellon Negotiate royalties with publishers Develop suitable business model Digital Rights Management – Lite Free-to-read by any Internet user Difficult to save or print books One page at a time using browser Secure servers restrict access Discourage hacking by offering affordable printed, bound books Global Business Model Hardback book = $30.00 Paperback book = $15.00 Digitalback book = cost of cup of coffee Internet Archive POD = $1.00 paperback India POD = $0.80 paperback; $2.00 hardback Open Access Feasibility Study Couldn’t locate publisher for 11% books If located publisher, half didn’t respond Even to second letter If got response Fewer than half gave permission Often permission was restricted 22% permission granted 1999-2000 – statistically valid random sample Permission by Publisher Type Success 22% Rate overall Scholarly 45% associations University 37% presses Museums 31% & galleries Commercial 12% publishers 75% 50% 25% 0% NoResponse Rate Success Rate Library Content Open Access Fine & Rare Books 367 titles in copyright (34% of collection) Couldn’t locate copyright holder for 13% of titles 127 letters & 44 follow-up calls to date 56% titles permission granted 6% with restrictions 3% titles permission denied Assumed if 3 contacts get no response 2002-2003 Transaction Costs $37.00 per title $ 6,550 FTE labor $ 225 Phone calls $ 65 $ 6,840 Paper & postage TOTAL May 2003 through August 2003 Does not include legal fees, cost of Internet connectivity or administrator time Million Book Project Copyright Negotiations Educate Find online, but use print Online access increases use Open access doesn’t decrease, & can increase sales Copyright absurdity Ask Non-exclusive permission to scan & provide open access Minimal system functionality Give Preservation-quality copies Metadata & OCR Motivate $$ Use in added-value, fee-based services $$ Print on demand for out-of-print titles $$ Buy button for in-print titles Initial Copyright Approach Do not pay permission cost Focus on out-of-print, in-copyright titles Books for College Libraries has 50,000 titles Begin with scholarly associations & university presses Transaction cost per title is prohibitive Identifying & inserting titles in letters Negotiating & tracking permission per title Epiphany & New Approach Focus on publishers of quality books Treat bibliographies as approval plan of publishers Books for College Libraries has 5600 publishers Ask for permission to digitize All out-of-print, in-copyright titles All titles published prior to a date of their choosing All titles published # or more years ago List of titles they provide Follow-up phone call or visit Current Statistics 5600 publishers in Books for College Libraries Using intermittent labor Couldn’t locate 30 publishers (so far) 184 letters & 24 follow-up calls to date 4% permission granted 5% permission denied Full-time staff October 2003 Results of New Approach Estimate transaction costs remain the same But acquire more books for $$ spent National Academy Press – 99% increase • 26 titles in Books for College Libraries • Permission for 3,046 titles Brookings Institution – 96% increase Rand McNally – 60% increase “More Bang for the Buck” Initial In Copyright Public Domain Indigenous Materials Current Projections Success rate # of books (# BCL publishers) per publisher Million Book Collection 4% (224) 1500 336,000 6% (336) 1500 504,000 22% (1,232) 1500 1,848,000 We could need to negotiate with India for more labor Usage Assessments Transaction log analysis – beginning 2004 Number of searches, browses, pages displayed Outcomes assessment – 2006-2007 Use title online & print-on-demand Userper demographics – age, gender, location Use different demographics Howby users foundgeographic the Collection Use per time of itday, day of week, Why they used & what they did month with it of the year What difference the Collection made Their assessment of the quality of the Collection & the usability & functionality of the system Their view of the significance of the project Copyright Assessments – 2006 Number of copyrighted books in the collection Success rate of permission requests Survey of participating publishers Overall satisfaction Quality of the copies What they did or plan to do with the copies Impact on revenue & view of open access Dissemination Million Book Collection Books accessible via Google search Libraries can link to collection from web site Libraries can link books to catalog records Publisher database Successful negotiation strategies Research test bed Thank you!