UC Libraries and the Implications of Mass Digitization Robin L. Chandler User’s Council May 11, 2007 Seek to achieve in this talk: • Status report on UC Libraries’ mass digitization projects • Impact of mass digitization on our collections and our users UC Libraries’ Mass Digitization Projects • Overview of two projects – Microsoft/Internet Archive – Google Books • Look at Operations • April 2007 status report on scanning Understanding Participant Roles • UC Libraries – supply & curate books – preserve digital files created – supply onsite scanning facilities when appropriate • Third-parties (Google, Microsoft, Yahoo) – provide funding for book scanning – manage digitization vendor Microsoft / Internet Archive • Production scanning began April 2006 • Internet Archive: Digitization Agent • Projected scope 100 K books (public domain) per year – Scanning books from all campus libraries • Scanning Centers (20 scanning machines) – Location: UC at NRLF and SRLF Google • Production scanning began October 2006 – Scanning books from NRLF currently • Projected Scope – 2.5 million books during 6 year period – Public domain /in-copyright • Scanning Center – Books transported to offsite scanning facility – Over 3K book / per day Workflow Steps (1) • #1 Project management • #2 Select, retrieve, inspect, mass charge /physical charge, physical transfer • #3 Sharing bibliographic records (over 3 K daily) • #4: Digitization: creating content files & metadata – JP2000, PDF, OCR – Metadata created during scanning including image coordinates Workflow Steps (2) • #5 Mass discharge / manual charge; books returned to shelves • #6 Quality control on digital files prior to ingest • #7 ingest of metadata and content files for preservation storage • #8 Enhance union and local catalog records with link to hosted content Motives: UC Libraries exploring models • Collection Management: Digital reformatting can help support our efforts to build shared print collections • Curating through Collaboration: Digitization of local materials creates access (for our patrons) to third-party materials not currently available • Funding Reallocation: Funds invested in licensing online collections of out of copyright materials could be reallocated to digital reformatting our unique content Mass Digitization Collection Advisory Group (MDCAG) • Approved by University Librarians • First meeting March 2007 • Charge: – Develop process for selection of book collections for scanning from across UC Libraries • Collection Development Committee (CDC) will approve collection selections April 2007 Status Report • Google – – – – 249,485 books transferred 235,633 books scanned 11,320 books rejected 55,264 books live • Microsoft/Internet Archive – 84,315 books transferred – 58,543 books scanned / books live – 25,772 books rejected Success due to our Systemwide collaboration! • UCB & UCLA Libraries / Northern and Southern Regional Library Facility teams • UC Library systemwide groups: ULs, SOPAG CDC, PAG, HOPS, Bibliographer Groups • Mass Digitization Collection Advisory Group (MDCAG) • CDL Programs: Bibliographic Services, Collections, Data Acquisitions, Digital Preservation Repository Microsoft: Sample Book Internet Archive: Sample Book Google: Sample Book Impacts of Mass Dig • Will we re-define our collections ? • How should we make collections available to our users? Mass Dig: Collections & Users: • All Libraries can be bigger than before – Leveraging the collections of other libraries to bring content to our users • Leveraging our collections ala the Long Tail – Libraries can learn from Netflix • Digitize local content – we all have special stuff! – Unique holdings support specialized disciplines • Prepare: demand for the physical item may increase – Digital access may increase relevance of analog • Book discovery increasingly happens outside the library – Information discovery (Google, MSM, Yahoo!) – Bibliographic discovery (Amazon) Our Users Today • Faculty, Graduates and Undergrads • Working in range of disciplines • Seeking efficiencies • Define their tool space • Resource needs are diverse – Can very day by day • They judge resource’s worth Dawn of the Embedded Library (1) • Web services embed library content into the browsing experience of users – Enable discovery, locate, request, and delivery – Library content must be exposed to aggregators • Examples: Library Thing, NCSU’s Catalog WS, LibX Firefox, Google Book Search – integrating web services for users and customizing software – Leveraging Catalog, Open URLs, COinS, APIs, etc. Dawn of the Embedded Library (2) • Providing user services – Find in a library, POD, download mobile devices, ILL, order from Amazon, etc. • Expose our content to aggregators and consume the data of others – OAI-PMH, SRU, Google Sitemap, Open Search, RSS feeds, mobile device searching Library Thing: Catalog Your Books Online – social bookmarking NCSU’s CatalogWS LibX: Providing direct access to your library’s resources Mass Dig & New Library Services • What systems are required to extract meaning from massive text collections? – Machine translation, data mining, etc. • What new modes of reading, representation and understanding are needed to interact with texts? – Linguistic, visual, and statistical processing • What collaborations between librarians, computer scientists and scholars are needed to do this exploration? – Standards, search queries, visualization, social networks Epilogue: Mr. Peabody’s WABAC (wayback) Machine • 1992 Conference on “Technology, Scholarship and the Humanities: The Implications of Electronic Information asked certain questions: • Will scholarship be better if it takes advantage of technology? • How will technology affect – The book? – The lecture? – The library? – The classroom? 1992: Historical Context • Cold War formally ended & US lifted trade sanctions against China • Bill Clinton was elected U.S. President • Four police officers were acquitted in Rodney King Trial • Johnny Carson left the Tonight Show • Earth Summit held in Rio de Janeiro • CD sales surpassed cassette tapes • OPACs and Gopher were in the library and a text-based web browser was first made available to the public….. Technology, Scholarship, & Humanities Conference: Viewpoints (1) • Richard A. Lanham, Professor of English, UCLA “As traditionally taught, each class exists in a temporal, conceptual and social vacuum…but if an electronic library were employed…students could read papers submitted in earlier classes, read scholarly articles on the same topics, read before-and-after examples of revised work, do searches of Shakespearean texts for imagery or rhetorical figuration, and make excerpts of videotaped performances to illustrate their papers – all without going to the campus library. Most importantly, a course like this would have a history and could be accessed by people in other courses; it would constitute a continuing society, its students becoming citizens of a commonwealth” Technology, Scholarship, & Humanities Conference: Viewpoints (2) • William Y. Arms, VP Computing Services Carnegie Mellon • “The scientific community has long-funded its capital-intensive projects with support from government and industry. In contrast, only 2 percent of humanities research funding comes from the U.S. government. As a result, the humanities can undertake few large, interdisciplinary projects unless the government and other funding agencies perceive the outcome to benefit the entire academic community…..” Thank you • Please feel free to contact me at robin.chandler@ucop.edu