Robin's PowerPoint presentation

UC Libraries and the Implications
of Mass Digitization
Robin L. Chandler
User’s Council
May 11, 2007
Seek to achieve in this talk:
• Status report on UC Libraries’ mass
digitization projects
• Impact of mass digitization on our
collections and our users
UC Libraries’
Mass Digitization Projects
• Overview of two projects
– Microsoft/Internet Archive
– Google Books
• Look at Operations
• April 2007 status report on scanning
Understanding Participant Roles
• UC Libraries
– supply & curate books
– preserve digital files created
– supply onsite scanning facilities when
• Third-parties (Google, Microsoft, Yahoo)
– provide funding for book scanning
– manage digitization vendor
Microsoft / Internet Archive
• Production scanning began April 2006
• Internet Archive: Digitization Agent
• Projected scope 100 K books (public
domain) per year
– Scanning books from all campus libraries
• Scanning Centers (20 scanning machines)
– Location: UC at NRLF and SRLF
• Production scanning began October 2006
– Scanning books from NRLF currently
• Projected Scope
– 2.5 million books during 6 year period
– Public domain /in-copyright
• Scanning Center
– Books transported to offsite scanning facility
– Over 3K book / per day
Workflow Steps (1)
• #1 Project management
• #2 Select, retrieve, inspect, mass charge /physical
charge, physical transfer
• #3 Sharing bibliographic records (over 3 K daily)
• #4: Digitization: creating content files & metadata
– JP2000, PDF, OCR
– Metadata created during scanning including
image coordinates
Workflow Steps (2)
• #5 Mass discharge / manual charge; books
returned to shelves
• #6 Quality control on digital files prior to
• #7 ingest of metadata and content files for
preservation storage
• #8 Enhance union and local catalog records
with link to hosted content
Motives: UC Libraries exploring models
• Collection Management: Digital reformatting can
help support our efforts to build shared print
• Curating through Collaboration: Digitization of
local materials creates access (for our patrons) to
third-party materials not currently available
• Funding Reallocation: Funds invested in licensing
online collections of out of copyright materials
could be reallocated to digital reformatting our
unique content
Mass Digitization Collection
Advisory Group (MDCAG)
• Approved by University Librarians
• First meeting March 2007
• Charge:
– Develop process for selection of book
collections for scanning from across UC
• Collection Development Committee (CDC)
will approve collection selections
April 2007 Status Report
• Google
249,485 books transferred
235,633 books scanned
11,320 books rejected
55,264 books live
• Microsoft/Internet Archive
– 84,315 books transferred
– 58,543 books scanned / books live
– 25,772 books rejected
Success due to our Systemwide
• UCB & UCLA Libraries / Northern and Southern
Regional Library Facility teams
• UC Library systemwide groups: ULs, SOPAG
CDC, PAG, HOPS, Bibliographer Groups
• Mass Digitization Collection Advisory Group
• CDL Programs: Bibliographic Services,
Collections, Data Acquisitions, Digital
Preservation Repository
Microsoft: Sample Book
Internet Archive: Sample Book
Google: Sample Book
Impacts of Mass Dig
• Will we re-define our collections ?
• How should we make collections available
to our users?
Mass Dig: Collections & Users:
• All Libraries can be bigger than before
– Leveraging the collections of other libraries to bring content to our
• Leveraging our collections ala the Long Tail
– Libraries can learn from Netflix
• Digitize local content – we all have special stuff!
– Unique holdings support specialized disciplines
• Prepare: demand for the physical item may increase
– Digital access may increase relevance of analog
• Book discovery increasingly happens outside the library
– Information discovery (Google, MSM, Yahoo!)
– Bibliographic discovery (Amazon)
Our Users Today
• Faculty, Graduates and
• Working in range of
• Seeking efficiencies
• Define their tool space
• Resource needs are diverse
– Can very day by day
• They judge resource’s worth
Dawn of the Embedded Library (1)
• Web services embed library content into the browsing
experience of users
– Enable discovery, locate, request, and delivery
– Library content must be exposed to aggregators
• Examples: Library Thing, NCSU’s Catalog WS, LibX
Firefox, Google Book Search
– integrating web services for users and customizing software
– Leveraging Catalog, Open URLs, COinS, APIs, etc.
Dawn of the Embedded Library (2)
• Providing user services
– Find in a library, POD, download mobile
devices, ILL, order from Amazon, etc.
• Expose our content to aggregators and
consume the data of others
– OAI-PMH, SRU, Google Sitemap, Open
Search, RSS feeds, mobile device searching
Library Thing: Catalog Your Books
Online – social bookmarking
NCSU’s CatalogWS
LibX: Providing direct access to
your library’s resources
Mass Dig & New Library Services
• What systems are required to extract meaning
from massive text collections?
– Machine translation, data mining, etc.
• What new modes of reading, representation and
understanding are needed to interact with texts?
– Linguistic, visual, and statistical processing
• What collaborations between librarians, computer
scientists and scholars are needed to do this
– Standards, search queries, visualization, social
Epilogue: Mr. Peabody’s
WABAC (wayback) Machine
• 1992 Conference on “Technology, Scholarship and the
Humanities: The Implications of Electronic Information
asked certain questions:
• Will scholarship be better if it takes advantage of
• How will technology affect
– The book?
– The lecture?
– The library?
– The classroom?
1992: Historical Context
• Cold War formally ended & US lifted trade sanctions
against China
• Bill Clinton was elected U.S. President
• Four police officers were acquitted in Rodney King Trial
• Johnny Carson left the Tonight Show
• Earth Summit held in Rio de Janeiro
• CD sales surpassed cassette tapes
• OPACs and Gopher were in the library and a text-based
web browser was first made available to the public…..
Technology, Scholarship, & Humanities
Conference: Viewpoints (1)
• Richard A. Lanham, Professor of English, UCLA
“As traditionally taught, each class exists in a temporal, conceptual
and social vacuum…but if an electronic library were
employed…students could read papers submitted in earlier classes,
read scholarly articles on the same topics, read before-and-after
examples of revised work, do searches of Shakespearean texts for
imagery or rhetorical figuration, and make excerpts of videotaped
performances to illustrate their papers – all without going to the
campus library. Most importantly, a course like this would have a
history and could be accessed by people in other courses; it would
constitute a continuing society, its students becoming citizens of a
Technology, Scholarship, & Humanities
Conference: Viewpoints (2)
• William Y. Arms, VP Computing Services
Carnegie Mellon
• “The scientific community has long-funded its capital-intensive
projects with support from government and industry. In contrast, only
2 percent of humanities research funding comes from the U.S.
government. As a result, the humanities can undertake few large,
interdisciplinary projects unless the government and other funding
agencies perceive the outcome to benefit the entire academic
Thank you
• Please feel free to contact me at
[email protected]