HATHI TRUST A Shared Digital Repository HathiTrust Digital Library Cooperation for Preservation Outline • About HathiTrust – Mission & Goals • Background • What we do – Services • How we do it – Governance – Partnership & Resources – Technology • Future Directions About What is HathiTrust • Shared Digital Repository – Launched 2008 by 25 institutions (now 26) – Initial focus on digitized book and journal content – Expanding to non-book/non-journal, born digital – “Light” archive • Collaboration – Preservation and access – Print collections – Local services – Public Good Background History • Michigan Digitization Project 2004 • “…U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered in cooperation with partner research libraries such as the institutions in the Digital Library Federation…” History • Collective Agreement with CIC Announced in June 2007 • CIC agreed to establish a shared digital repository History CIC Shared Digital Repository HathiTrust The Partners • When announced in October 2008, partners included: – University of California system – CIC (Committee on Institutional Cooperation) University of Chicago University of Illinois Indiana University University of Iowa University of Michigan Michigan State University – University of Virginia University of Minnesota Northwestern University Ohio State University Pennsylvania State University Purdue University University of Wisconsin-Madison Columbia University The Name • The meaning behind the name – Hathi (hah-tee)--Hindi for elephant – Big, strong – Never forgets, wise – Secure – Trustworthy Content Distribution As of February 1: 5,323,716 - Total 764,481 - Public Domain Content Growth What we do Services • Bit-level preservation and • migration Long-term preservation • Viewing • Redistribution • Print disabilities • Section 108 Access • Inbound validation • Fixity checks Google ingest • Rights database • Copyright review • Collection Builder • Metadata files • Bib API • Data API Rights management Publish virtual collections Availability of data • Temporary catalog • Version 1 permanent catalog April 2010 • November 2009 • UM public domain • UM Press Bibliographic search Full-text search Print on Demand How we do it Governance Budget/Finances Decision-making Policy Planning Strategic Advisory Board Executive Committee HathiTrust Executive Committee • • • • • • • • • Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive Director, CDL John King, Vice Provost for Academic Information, UM Paula Kaufman, University Librarian and Dean of Libraries, UI Brian Schottlaender, University Librarian, UCSD Ed Van Gemert, Director of Libraries, UW - Madison Brenda Johnson, Dean of Libraries, IU Brad Wheeler, Chief Information Officer, IU John Wilkin, Executive Director of HathiTrust and Associate University Library, LIT, UM Strategic Advisory Board • Ed Van Gemert (Chair), Director of Libraries, UW - Madison • John Butler, Associate University Librarian for Information Technology, U Minn • Patricia Cruse, Director, Preservation, CDL • Bernie Hurley, Director, Library Technologies, UC Berkeley • R. Bruce Miller, University Librarian, UC - Merced • Sarah Pritchard, University Librarian, Northwestern • Paul Soderdahl, Director, LIT, U Iowa • John Wilkin, Executive Director, HathiTrust (ex officio) Partnership & Resources (1) • Funded for a initial 5 years with base-funding from partners • Budget – separately held within UMich budget system, managed by the Executive Committee • Cost Model – Per GB cost of storage per year with a one-time fee on new content to build a capital fund • Review in 3rd yr of each 5 yr period Partnership & Resources (2) • Staff/Expertise – highly integrated – Project managers, IT and communications staff, copyright experts, administrators (UM, Indiana and UC taking the lead) • Working groups • UM recently hired a Digital Preservation Librarian • Shared development space Governance Budget, Finances Decision-making Policy Enterprise Management Repository Administration Repository Administration Communication and Coordination with partner institutions Hardware configuration and maintenance Data management (content storage, backup, integrity checks, deletion) Project management Planning Web and application server configuration and maintenance Security Hardware selection and replacement Content and Metadata specifications Permissions Rights Management Bibliographic Data Management Copyright determination Entity description (record-level) Copyright review Object identification (item-level) Copyright information management (database) Data availability Collection Development Digital • Expansion beyond books and journals (born-digital, images and maps, audio) • Selection of content (for nonGoogle volume ingest and pilots projects) Print • Cloud Library (effect of digital on print) Rightsholder permissions Disaster Recovery Logging Processes for ensuring content integrity e-Commerce Print on Demand Content Ingest Content Access Quality Assurance User Services Transformation PageTurner Quality Review Usability Validation Collection Builder Content Certification User support (helpdesk) Large-scale Search Financial contributions of partners Research Center Bibliographic Catalog APIs Outreach Project website Monthly newsletter Papers and presentations HathiTrust Functional Framework Communication with potential partners Surveys, general inquiries Repository evaluation and audit (e.g., DRAMBORA, TRAC) Legal Risk management (use of materials) Partner agreements Advocacy Partnership & Resources (3) • Toward a Cloud Library – CLIR, Mellon Foundation – OCLC Research, NYU, HathiTrust, Recap Libraries • Objective: Characterize the near-term opportunity for externalizing management of academic research collections leveraging capacity of large-scale shared print and digital repositories* • Outcomes: opportunity and risk assessment based on aggregate collection analysis; draft service agreement enabling generic consumer library to selectively outsource preservation and access of low-use research collections to large-scale print and digital repositories *From the RLG Partner Update January 7, 2010 Partnership & Resources (4) • CRL TRAC Audit – Portico and HathiTrust assessments timely – “Certification will augment CRL’s strategic archiving of print, and support a responsible transition to electroniconly formats where appropriate.” – Work with UC to design shared print journal archiving effort – “With this hybrid strategy CRL hopes to enable its community to accelerate the shift to electronic-only resources in a careful and responsible manner.” * http://www.crl.edu/archiving-preservation/digitalarchives/certification-and-assessment-digital-repositories Partnership & Resources (5) • New cost model • Based on benefits to institutions – Public Domain – In-copyright • Volumes “held” Partnership & Resources (6) • Timeline: – Implement in 2013 – Accept new partners now with costs based on overlap calculations • Requirements: – Print holdings database – Update mechanisms – Manual remediation Technology - OAIS MARC record extensions (Aleph) Rights DB GROOVE (JHOVE) Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Google [OCA] In-house Conversion ; GRIN Internal Data Loading METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums Isilon Site Replication TSM MD5 checksum validation METS object PNG OCR PDF Technology – Architecture • Inbound validation, standards-based object storage and related metadata • Storage in Ann Arbor and Indianapolis • Encrypted backup to 3rd location • Rights database for rights metadata • Online catalog as source and storage for descriptive metadata Technology - Ingest • Automatic validation in GROOVE – Check barcode check digit using Luhn algorithm – Fixity check on JPG2000, TIFF, UTF8 using MD5 – Well-formedness and embedded metadata check on JPG2000, TIFF, UTF8 using JHOVE • Creation of METS and PREMIS Technology - Repository • Isilon storage • Simple filesystem layout – One directory per volume, zip file and METS file – Use of a namespace allows for conflicting identifiers – Namespaces for institutions and, if needed, types of identifiers within the institution Technology – METS Object • Why METS? – Can serve as Archival Information Package and a Dissemination Information Package – Designed to record the relationship between pieces of complex digital objects – Can be created automatically as texts are loaded or reloaded – Preservation actions (PREMIS) Technology – METS Object • What’s there? – metsHdr with an ID and CREATEDATE – 2 dmdSecs: Marcxml and mdRef – amdSec containing one techMD with PREMIS metadata – fileSec with 4 fileGrps (zip, images, OCR, hOCR) – Physical structMap tying together files with metadata (pg. numbers and features) Future Directions Future Directions (1) • SAB • SAB • SAB • SAB • Current and ongoing areas 3-year review OCLC catalog Quality Deduplication TRAC compliance • Full-PDF • Collection Builder • Section 108 • Users with print disabilities • IA-digitized • locally-digitized • Audio pilot • Images (maps) • Beginning to investigate ePub as a delivery format • Data API Non-Google print content Nonbook/nonjournal Born-digital Openness Shibboleth Future Directions (2) • PageTurner • Advanced search • Search facets • Collection Builder • Isilon software • June 2010 • CB Integration • Advanced search • Index optimizing • New hardware • Wisconsin • University of California Collaborative Development Fixity checking Large-scale Search Ingest reporting Bibliographic management • University of California • NSF EAGER • Mellon Quality • Partner Institutions • Partner Institutions • Research Center • Data distribution • Tools such as SEASR Content validation Grant projects Usage reporting Holdings database Data mining tools Links • Catalog, Full-text search, and Collection Builder – http://catalog.hathitrust.org • METS and PREMIS implementation – http://www.hathitrust.org/preservation • Technical profile: – http://www.hathitrust.org/technology • Technical flow diagram – http://www.hathitrust.org/documents/HathiTrust-PASIG-200910.pdf – http://www.hathitrust.org/documents/HathiTrust-PASIG-notes200910.pdf • Rights management – http://www.hathitrust.org/rights_management • TRAC – http://www.hathitrust.org/accountability Thank You! hathitrust-info@umich.edu jjyork@umich.edu http://www.hathitrust.org