Massively Digitizing UC Library Collections Google, Microsoft, and More Learning in Retirement Libraries – The Intersection of Tradition and Innovation April 10, 2008 Ivy Anderson & Heather Christenson California Digital Library “11th University Library” founded 1997 Two Complementary Roles Part of UC Office of the President Three Audiences Facilitate library collaboration across the ten campuses of the UC system (e.g. shared collection development) Distinctive services emphasizing digital stewardship, innovation in scholarly publishing, and open-access digital collections UC libraries Broader UC community External constituencies and the general public Five Programs Collection Development and Management (Licensed Content, Shared Print Collections, Mass Digitization) Bibliographic Services (Melvyl Catalog, SFX) Preservation (Digital Preservation Repository, Web Archiving) Digital Special Collections (Calisphere, Online Archive of California) Publishing Services (eScholarship Repository, eScholarship Editions, collaboration with UC Press) Digitization of Library Collections Special Collections Manuscripts, archival collections, photographs, etc. CDL / UC Libraries Berkeley, University of California, Bancroft Library, UCB 150, f. 252v Online Archive of California Calisphere Digitization of Library Collections Specialized Texts and Corpora Making of America -10,000 texts in 10 years CDL eScholarship Editions Digitization of Library Collections Commercial Partnerships Satans stratagems, 1648. copy from UCLA Library EEBO: 100,000 important early English texts Licensed access via ProQuest …and Along Came Google Google Library Project 2005: The ‘Google Five:’ Harvard, Oxford, New York Public Library, Stanford, University of Michigan 2008: 20 library partners in 5 countries Google Publisher Partner Program …and the Open Content Alliance October 2005 Founders: Internet Archive, University of California, U of Toronto… Large-scale digitization of out-ofcopyright works only A project of the Internet Archive …and Microsoft Out-of-Copyright Works Only UC Involvement Founding Member of Open Content Alliance October 2005 UC Joins Google Library Project August 2006 Microsoft Digitization Agreement March 2007 So: Three Projects, One Goal Goal: Mass digitization of library book collections Google Microsoft In-copyright and out-of-copyright works Available via Google search engine and Google Book Search Out-of-copyright works only Available via Microsoft Live Search Open Content Alliance Out-of-copyright works only Available (via the Internet Archive website) to any and all search engines Library and grant-funded Why Are They Doing It? Google’s vision: To put all the world’s information online Google and Microsoft: To gain marketshare and competitive advantage for their search (and online advertising) services It’s all about Search OCA: To put the world’s information online, for free, forever It’s all about the public good Why Are We Doing It? To enhance student and faculty research To fulfill our public service mission To put our collections where our users are – in Google! Mass digitization of these materials enhances access. It can make people aware of books they may not have discovered otherwise and lead them, through an internet search, back to our libraries To support deeper textual analysis and research. Scholars can trace the evolution of ideas and perform other sophisticated textual analysis when the full text is indexed and searchable by computer, opening scholarship in new ways. Many books of enduring general interest – including classic works of literature and more unique items such as early histories of the settlement of California and the West - can now be read by anyone, anywhere, anytime To preserve and protect our collections In earthquake and fire-prone California, digitizing books in our collections may also help protect the university from catastrophic loss should disaster someday strike our libraries Microsoft/OCA Project Contributors Northern Regional Library Facility (NRLF) Southern Regional Library Facility (SRLF) UC Berkeley, Bancroft Library UCLA Google Project Contributors Northern Regional Library Facility (NRLF) + UC Berkeley Systems UC Santa Cruz UC San Diego CDL’s role, on behalf of UC Liaison with partners Planning & coordination Funding Stewardship of digital content New services Campuses Provide the Books The Book Digitization Process A world of barcodes, logistics, loading docks, packing materials, and scanning machines! Reasons books might get rejected (images) Costs to the UC Libraries Staffing (2-5 FTE at each of 5 locations) Physical space & facilities Scanning centers (where scanning machines are housed), book processing, queue storage (book trucks) Costs to run campus systems CDL servers for inventory database, digital preservation Digital files Images OCR - Text OCR - Page coordinates Metadata What sort of books are being digitized? American history Humanities Science Cookbooks Children’s books East Asian & Pacific Rim collections Where can you access the books? Google Book Search: http://books.google.com/ Microsoft Live Search Books: http://search.live.com/results.aspx?q=&scop e=books Internet Archive: http://www.archive.org/details/university_of_c alifornia_libraries Test version of UC Union catalog: http://melvyl-test.cdlib.org:8164/F Copyright status is a factor Out of copyright, pre-1923 “orphan works,” 1923-1964 1965 - present At the frontier… What’s ahead Digital preservation –storage, storage, storage Copyright determination Print on demand New modes of access & critical mass of digital books will transform scholarship Full text search - new form of book discovery Beyond search – text mining, computationally assisted research Machines can interact with massive amounts of texts, and provide new structures Questions? Heather Christenson, CDL Mass Digitization Project Manager heather.christenson@ucop.edu Ivy Anderson, CDL Director of Collections ivy.anderson@ucop.edu For more information: http://www.cdlib.org/inside/projects/mas sdig/