HATHI TRUST A Shared Digital Repository HathiTrust 101 John Wilkin and Jeremy York August 27, 2010 Outline • About HathiTrust – Mission & Goals • • • • • • Governance Content What we do (services) Partnership & Resources Technology Future Directions Current Partners – – – – Columbia University New York Public Library University of California system CIC (Committee on Institutional Cooperation) University of Chicago University of Illinois Indiana University University of Iowa University of Michigan Michigan State University University of Minnesota Northwestern University Ohio State University Pennsylvania State University Purdue University University of Wisconsin-Madison – Triangle Research Library Network – University of Virginia – Yale University HathiTrust Universal Digital Library Common Goal Single Entity, Many Partners Governance Budget/Finances Decision-making Strategic Advisory Board Executive Committee HathiTrust Guidance on Policy, Planning Executive Committee • • • • • • Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive Director, CDL John King, Vice Provost for Academic Information, UM Paula Kaufman, University Librarian and Dean of Libraries, UI Brian Schottlaender, University Librarian, UCSD Ed Van Gemert, Deputy Director of Libraries, UW – Madison (ex officio) • Brenda Johnson, Dean of Libraries, IU • Brad Wheeler, Chief Information Officer, IU • John Wilkin, Executive Director of HathiTrust and Associate University Librarian, LIT, UM Strategic Advisory Board • Ed Van Gemert (Chair), Deputy Director of Libraries, UW Madison • John Butler, Associate University Librarian for Information Technology, U Minn • Patricia Cruse, Director, Preservation, CDL • Bernie Hurley, Director, Library Technologies, UC Berkeley • R. Bruce Miller, University Librarian, UC - Merced • Sarah Pritchard, University Librarian, Northwestern • Paul Soderdahl, Director, LIT, U Iowa • John Wilkin, Executive Director, HathiTrust (ex officio) • Robert Wolven, Columbia University Content Distribution 6,331,718 – Total 1,215,210 – Public Domain * As of July 25, 2010 Language Distribution (1) * As of July 25, 2010 Language Distribution (2) The next 40 languages make up ~13% of total * As of July 25, 2010 Dates * As of July 25, 2010 Originating Institution * As of July 25, 2010 Content over time * As of July 25, 2010 Content Growth Services • Bit-level preservation and migration Long-term preservation • Viewing • Redistribution • Print disabilities • Section 108 Content Access • Rights database • Copyright review • Temporary catalog • Version 1 permanent catalog Summer 2010 • November 2009 Rights management Bibliographic search Full-text search • UM public domain • UM Press • Collection Builder • Metadata files • Bib API • Data API • Inbound validation • Fixity checks • Full-PDF download • Collection Builder Print on Demand Publish virtual collections Availability of data Google and IA ingest Shibboleth • Supporting partner development • Datasets • Protocol • Research Center • Born digital • Images/maps • Audio Development Environment Computational Research Beyond Books and Journals Focus on users • Preservation…with Access • Brings concerns of research libraries to bear on the way the scholarly record is cared for and made available – – – – – Scholarly Resource Bibliographic Search Full-text search Collections Full-PDF download of public domain Cost Model 1 Reasonable costs of sustaining the archive, includes cost of replacement, capital fund Cost Model 1 • Economies of scale keep costs low – $0.145/volume/year for Google-digitized – about $0.45/volume/year for IA-digitized • Advantages not fully known until you jump in Cost Model 2 For public domain volumes: (PD*X*C)/N For a given incopyright volume: IC=(C*X)/H • • • • Share in costs of curation Share in uses of relevant materials Voice in future directions Free riders? Cost Model 2 • Sustaining common resource • Costs go down • Quality of services increases – Realize in aggregated collection, something don’t get through distributed search or federation Cost Model 2: Timeline & Requirements • Timeline: – Implement in 2013 – Accept new partners now with costs based on overlap calculations • Requirements: – Print holdings database – Update mechanisms – Manual remediation Print Holdings Database • Print holdings database will also benefit – De-duplication • Compromises user experience, obscures collection development needs – Management of print volumes • Information to withdraw volumes (journals) – Legal uses of copyright materials • Section 108, 121, ADA uses will depend knowledge of which institutions own(ed) which materials Staff • Staff/Expertise – highly integrated – Project managers, IT and communications staff, copyright experts, administrators – Working groups • Shared development space Governance Budget, Finances Decision-making Policy Enterprise Management Repository Administration Repository Administration Communication and Coordination with partner institutions Hardware configuration and maintenance Data management (content storage, backup, integrity checks, deletion) Project management Planning Web and application server configuration and maintenance Security Hardware selection and replacement Content and Metadata specifications Permissions Rights Management Bibliographic Data Management Copyright determination Entity description (record-level) Copyright review Object identification (item-level) Copyright information management (database) Data availability Collection Development Digital • Expansion beyond books and journals (born-digital, images and maps, audio) • Selection of content (for nonGoogle volume ingest and pilots projects) Print • Cloud Library (effect of digital on print) Rightsholder permissions Disaster Recovery Logging Processes for ensuring content integrity e-Commerce Print on Demand Content Ingest Content Access Quality Assurance User Services Transformation PageTurner Quality Review Usability Validation Collection Builder Content Certification User support (helpdesk) Large-scale Search Financial contributions of partners Research Center Bibliographic Catalog APIs Outreach Project website Monthly newsletter Papers and presentations HathiTrust Functional Framework Communication with potential partners Surveys, general inquiries Repository evaluation and audit (e.g., DRAMBORA, TRAC) Legal Risk management (use of materials) Partner agreements Advocacy Collection Development • Digitization/Collaboration with other initiatives • Public domain determinations • Duplicate volumes • Citation • Building Collections • Quality A global change in the library environment 60% Academic print book collection already substantially duplicated in mass digitized book corpus 50% % of Titles in Local Collection June 2010 Median duplication: 31% 40% 30% 20% June 2009 Median duplication: 19% 10% 0% 0 20 40 60 80 Rank in 2008 ARL Investment Index 100 120 Digitized Books in Shared Repositories ~3.5M titles 3,500,000 3,000,000 ~75% of mass digitized corpus is ‘backed up’ in one or more shared print repositories ~2.5M Unique Titles 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 Sep-09 Oct-09 Nov-09 Dec-09 Mass digitized books in Hathi digital repository Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Mass digitized books in shared print repositories Technology - OAIS MARC record extensions (Aleph) Rights DB GROOVE (JHOVE) Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Google [OCA] In-house Conversion ; GRIN Internal Data Loading METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums Isilon Site Replication TSM MD5 checksum validation METS object PNG OCR PDF Future Directions • Locally-digitized partner content • • Usage reporting • • Coordinate digital and print resources • (holdings database) • • Computational Research • • Quality • • Strategies for openness • • Collaborative Development • Extending Services through Shibboleth• • • Non-book, non-journal content Born-digital content New Bibliographic Management Compliance with TRAC Grant projects OCLC Catalog 3-year review Improvements to Large-scale Search Improvements to PageTurner Ingest Reporting How can HathiTrust make a difference? • Digital Curation – – – – – – Drive costs down Reduce bibliographic indeterminacy Make meaningful decisions about formats and quality Increase discoverability Consolidate development talent Improve strength of archiving • Print Curation – Means to associate our print holdings – Coordinated record-keeping • Subsidiary benefits – Improve description – Quantify problems – Collective attention to solving shared problems