HATHI TRUST A Shared Digital Repository Columbia University and HathiTrust Collaboration at a new level Outline • About HathiTrust – Mission & Goals • Background • What we do (services) – Objectives • • • • Governance Partnership & Resources Technology Future Directions About What is HathiTrust Universal Digital Library Common Goal Single Entity but Partnership of Many Libraries Goals • Reliable and comprehensive archive of materials converted from print…co-owned • Ensure the long-term preservation of content • Improve access …to meet the needs of the coowning institutions • Coordinate shared storage strategies • “public good” …sustaining the historical record • Simultaneously …centralized …open Background History • Michigan Digitization Project 2004 • “…U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered in cooperation with partner research libraries such as the institutions in the Digital Library Federation…” History • Collective Agreement with CIC Announced in June 2007 – U of Michigan and U of Wisconsin Projects already underway History • In 2007, CIC agreed to establish a shared digital repository • University of Michigan and Indiana University initial leaders of this effort History CIC Shared Digital Repository HathiTrust The Partners • When announced in October 2008, partners included: – University of California system – CIC (Committee on Institutional Cooperation) University of Chicago University of Illinois Indiana University University of Iowa University of Michigan Michigan State University – University of Virginia University of Minnesota Northwestern University Ohio State University Pennsylvania State University Purdue University University of Wisconsin-Madison Columbia University The Name • The meaning behind the name – Hathi (hah-tee)--Hindi for elephant – Big, strong – Never forgets, wise – Secure – Trustworthy Content Distribution 5,317,545 - Total 764,103 - Public Domain Content Growth What we do Services • Bit-level preservation and migration • Rights database • Copyright review Long-term preservation Rights management • Inbound validation • Fixity checks Google ingest • Viewing • Redistribution • Print disabilities • Section 108 Access (within bounds of law and settlement) • Temporary catalog • Version 1 permanent catalog April 2010 Bibliographic search • Collection Builder • Metadata files • Bib API • Data API Publish virtual collections Availability of data • November 2009 • UM public domain • UM Press Full-text search Print on Demand Functional Objectives •Improved performance •Right now at UM only •Plans to extend PageTurner Access for users with print disabilities •Metadata files •Bib API •Data API •Collection Builder Publish virtual collections •Identification at partner institutions Branding •IA-digitized •locally-digitized •Full-PDF download •Collection Builder •Section 108 (later on) •Users with print disabilities (later on) •Index optimization •Ongoing hardware acquisition Non-Google digitized print content Extending services through Shibboleth Improvements to large-scale search •Beginning to investigate ePub as a delivery format •Isilon software •June 2010 •Including outstanding areas like disaster recover •Ongoing basis •PageTurner •Advanced search •Search facets •Collection Builder Born-digital Fixity checking Compliance with TRAC Collaborative Development Environment Strategies for Openness •Temporary catalog •Version 1 permanent catalog April 2010 Public discovery interface •Audio pilot •Images (maps) Non-book/nonjournal content •Research Center •Data distribution •Tools such as SEASR Data mining tools Governance Governance Budget/Finances Decision-making Policy Planning Strategic Advisory Board Executive Committee HathiTrust Executive Committee • • • • • • • • • Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive Director, CDL John King, Vice Provost for Academic Information, UM Paula Kaufman, University Librarian and Dean of Libraries, UI Brian Schottlaender, University Librarian, UCSD Ed Van Gemert, Director of Libraries, UW - Madison Brenda Johnson, Dean of Libraries, IU Brad Wheeler, Chief Information Officer, IU John Wilkin, Executive Director of HathiTrust and Associate University Library, LIT, UM Strategic Advisory Board • Ed Van Gemert (Chair), Director of Libraries, UW - Madison • John Butler, Associate University Librarian for Information Technology, U Minn • Patricia Cruse, Director, Preservation, CDL • Bernie Hurley, Director, Library Technologies, UC Berkeley • R. Bruce Miller, University Librarian, UC - Merced • Sarah Pritchard, University Librarian, Northwestern • Paul Soderdahl, Director, LIT, U Iowa • John Wilkin, Executive Director, HathiTrust (ex officio) Partnership & Resources Partnership & Resources (1) • Funded for a initial 5 years with base-funding from partners • Budget – separately held within UMich budget system, managed by the Executive Committee • Cost Model – Per GB cost of storage per year with a one-time fee on new content to build a capital fund • Review in 3rd yr of each 5 yr period Partnership & Resources (2) • Staff/Expertise – highly integrated – Project managers, IT and communications staff, copyright experts, administrators (UM, Indiana and UC taking the lead) • Working groups • UM recently hired a Digital Preservation Librarian • Shared development space Governance Budget, Finances Decision-making Policy Enterprise Management Repository Administration Repository Administration Communication and Coordination with partner institutions Hardware configuration and maintenance Data management (content storage, backup, integrity checks, deletion) Project management Planning Web and application server configuration and maintenance Security Hardware selection and replacement Content and Metadata specifications Permissions Rights Management Bibliographic Data Management Copyright determination Entity description (record-level) Copyright review Object identification (item-level) Copyright information management (database) Data availability Collection Development Digital • Expansion beyond books and journals (born-digital, images and maps, audio) • Selection of content (for nonGoogle volume ingest and pilots projects) Print • Cloud Library (effect of digital on print) Rightsholder permissions Disaster Recovery Logging Processes for ensuring content integrity e-Commerce Print on Demand Content Ingest Content Access Quality Assurance User Services Transformation PageTurner Quality Review Usability Validation Collection Builder Content Certification User support (helpdesk) Large-scale Search Financial contributions of partners Research Center Bibliographic Catalog APIs Outreach Project website Monthly newsletter Papers and presentations HathiTrust Functional Framework Communication with potential partners Surveys, general inquiries Repository evaluation and audit (e.g., DRAMBORA, TRAC) Legal Risk management (use of materials) Partner agreements Advocacy Partnership & Resources (3) • Toward a Cloud Library – CLIR, Mellon Foundation – OCLC Research, NYU, HathiTrust, Recap Libraries • Objective: Characterize the near-term opportunity for externalizing management of academic research collections leveraging capacity of large-scale shared print and digital repositories* • Outcomes: opportunity and risk assessment based on aggregate collection analysis; draft service agreement enabling generic consumer library to selectively outsource preservation and access of low-use research collections to large-scale print and digital repositories *From the RLG Partner Update January 7, 2010 Partnership & Resources (4) • CRL TRAC Audit – Portico and HathiTrust assessments timely – “Certification will augment CRL’s strategic archiving of print, and support a responsible transition to electroniconly formats where appropriate.” – Work with UC to design shared print journal archiving effort – “With this hybrid strategy CRL hopes to enable its community to accelerate the shift to electronic-only resources in a careful and responsible manner.” * http://www.crl.edu/archiving-preservation/digitalarchives/certification-and-assessment-digital-repositories Partnership & Resources (5) • New cost model • Based on benefits to institutions – Public Domain – In-copyright • Volumes “held” • Covered by Settlement – Print replacement, users with print disabilities; research corpus • Not – Section 108; expand via authentication Partnership & Resources (6) • Timeline: – Implement in 2013 – Accept new partners now with costs based on overlap calculations • Requirements: – Print holdings database – Update mechanisms – Manual remediation Partnership & Resources (7) • Print holdings database will also benefit – De-duplication • Compromises user experience, obscures collection development needs – Management of print volumes • Information to withdraw volumes (journals) – Legal uses of copyright materials • Section 108, 121, ADA uses will depend knowledge of which institutions own(ed) which mate Technology Technology - OAIS MARC record extensions (Aleph) Rights DB GROOVE (JHOVE) Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Google [OCA] In-house Conversion ; GRIN Internal Data Loading METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums Isilon Site Replication TSM MD5 checksum validation METS object PNG OCR PDF Technology – Architecture • Inbound validation, standards-based object storage and related metadata • Storage in Ann Arbor and Indianapolis • Encrypted backup to 3rd location • Rights database for rights metadata • Online catalog as source and storage for descriptive metadata Technology - Ingest • Automatic validation in GROOVE – Check barcode check digit using Luhn algorithm – Fixity check on JPG2000, TIFF, UTF8 using MD5 – Well-formedness and embedded metadata check on JPG2000, TIFF, UTF8 using JHOVE Technology - Repository • Simple filesystem layout – One directory per volume, zip file and METS file – Use of a namespace allows for conflicting identifiers – Namespaces for institutions and, if needed, types of identifiers within the institution Technology – METS Object • Why METS? – Can serve as Archival Information Package and a Dissemination Information Package – Designed to record the relationship between pieces of complex digital objects – Can be created automatically as texts are loaded or reloaded – Preservation actions (PREMIS) Technology – METS Object • What’s there? – metsHdr with an ID and CREATEDATE – 2 dmdSecs: Marcxml and mdRef – amdSec containing one techMD with PREMIS metadata – fileSec with 4 fileGrps (zip, images, OCR, hOCR) – Physical structMap tying together files with metadata (pg. numbers and features) Future Directions Future Directions •Partner Institutions •Partner Institutions •SAB working group •SAB working group) •SAB working group •SAB Usage reporting Holdings database Quality De-duplication OCLC catalog 3-year review •Research Center •Data distribution •Tools such as SEASR •Wisconsin •University of California •University of California •Full-PDF download •Collection Builder •Section 108 (later on) •Users with print disabilities (later on) Data mining tools Ingest reporting New bibliographic management Content validation Extending services through Shibboleth •Data API •IA-digitized •locally-digitized •Isilon software •June 2010 •Including outstanding areas like disaster recover •ongoing areas •PageTurner •Advanced search •Search facets •Collection Builder Non-Google digitized print content Fixity checking Compliance with TRAC Collaborative Development Environment •UC and GnuBook •Partner Institutions Improvements to PageTurner •Beginning to investigate ePub as a delivery format Born-digital Strategies for Openness •CB Integration •Advanced search/facets •Index optimization •Ongoing hardware acquisition Improvements to Large-scale Search •Audio pilot •Images (maps) Non-book/nonjournal content •NSF EAGER •Mellon Quality Grant projects Thank You! jjyork@umich.edu http://www.hathitrust.org