HATHI TRUST A Shared Digital Repository HathiTrust Open Webinar Jeremy York Project Librarian, HathiTrust May 3 and 5, 2011 Outline • • • • • • • Overview Mission and Goals Content Services Governance, how the partnership operates Partnership Changing Library Landscape About Current Partners Arizona State University Baylor University California Digital Library Columbia University Cornell University Dartmouth College Duke University Emory University Harvard University Library Indiana University Johns Hopkins University Library of Congress Massachusetts Institute of Technology Michigan State University New York University New York Public Library North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Texas A&M University Universidad Complutense de Madrid University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Illinois University of Illinois at Chicago The University of Iowa University of Maryland University of Michigan University of Minnesota The University of North Carolina at Chapel Hill University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of WisconsinMadison Utah State University Yale University Library HathiTrust Community Mission • To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge Mission and Goals HathiTrust Universal Library Common Goal Single Entity, Many Partners Goals • Comprehensive collection • Preservation…with Access • Shared strategies – – – – Collection management, development Preservation Copyright Efficient user services • Openness Mission and Goals Content What is in HathiTrust? • • • • 8,625,158 Total volumes 2,297,041 Public Domain 4,722,664 Book titles 209,930 Serial titles * As of May 1, 2011 Content Sources * As of May 1, 2011 Content Distribution * As of May 1, 2011 Dates * As of May 1, 2011 Statistics and Visualizations Breakdown of HathiTrust book corpus by publication date Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011 Breakdown of HathiTrust book corpus by publication date Language Distribution (1) The top 10 languages make up ~86% of all content * As of May 1, 2011 Statistics and Visualizations Language Distribution (2) The next 40 languages make up ~13% of total * As of May 1, 2011 Statistics and Visualizations Content over time 100% Chicago 90% Madrid 80% Columbia 70% LoC Harvard 60% Minnesota 50% Indiana 40% Princeton NYPL 30% Cornell 20% Wisconsin 10% California 0% Michigan * As of May 1, 2011 Content Growth A global change in the library environment 60% Academic print book collection already substantially duplicated in mass digitized book corpus 50% % of Titles in Local Collection June 2010 Median duplication: 31% 40% 30% 20% June 2009 Median duplication: 19% 10% 0% 0 20 40 60 80 Rank in 2008 ARL Investment Index 100 120 Digitized Books in Shared Repositories ~3.5M titles 3,500,000 3,000,000 ~75% of mass digitized corpus is ‘backed up’ in one or more shared print repositories ~2.5M Unique Titles 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 Sep-09 Oct-09 Nov-09 Dec-09 Mass digitized books in Hathi digital repository Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Mass digitized books in shared print repositories Services Services (1) • Ingest – Book and Journal content • Google • Internet Archive • In-house, other vendor digitization – Images, Audio, Born digital (coming soon…) • Two parts – Bibliographic Data – Content Getting Content Into HathiTrust | Building a Future by Preserving our Past Services (2) • Long-term preservation – Bit-level, migration – Standard and open formats (ITU G4 TIFF, JPEG2000, JPG, Unicode) – Validation, integrity, redundancy – OAIS • How reliable is it? – DRAMBORA, TRAC Preservation | Technology | TRAC Technology - OAIS MARC record extensions (Aleph) Rights DB GROOVE (JHOVE) Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Google Internet Archive In-house Conversion ; GRIN Internal Data Loading METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums Isilon Site Replication TSM MD5 checksum validation Technology METS object PNG OCR PDF Quality • • • • Partner Digitization Google Digitization Quality work / Volume certification feedback@issues.hathitrust.org Quality Services (3) • Preservation…with Access – As part of preservation, service to partners, and as public good – Discovery • Bibliographic (temporary catalog, OCLC/HathiTrust catalog) • Full-text – Reading • Interface optimized for users with print disabilities – Collections Searching, Reading, and Building Collections Access Matrix Type of work Public domain worldwide Public domain in the US Search – Bib and Full text World View Full-PDF download Print on Demand World World World US World if no restrictions, Partners if restrictions US if no restrictions, US partners if restrictions World if no restrictions Open World Access (+Creative Commons) In World copyright (and undetermin ed) World US Print Section 108 disabilities (preservation uses) Partners N/A worldwide US Partners World with Partners permission worldwide if no restrictions Not Not available Not Partners available available US and worldwide, where applicable N/A N/A Partners US and worldwide, where applicable Services (4) • Rights Management – Rights Database – Copyright review • IMLS Grant awarded to University of Michigan 2008 to determine copyright status of books published in US between 1923 and 1963 • 18 staff members, 4 institutions – – – – Indiana University University of Michigan University of Minnesota University of Wisconsin • 125k reviewed through CRMS • 67,000 (54%) in public domain Copyright Copyright status of books published pre-1923 and US works published 1923-1963 Copyright status of books published pre-1923 and US works published 1923-1963 Services (5) • Data Availability – Tab-delimited inventory files – Bibliographic API – Data API – OAI feed of public domain – SFX target – Summon Hathifiles | Data Distribution and APIs Services (6) • Collaborative Development Environment – Active repository development • Support for Computational Research – Datasets • 120,000-volume set • Google-digitized public domain – Protocol-based access – Research Center Datasets How Different from Google? • • • • • • Preservation Content Collective work Uses of materials Own trajectory Partnership – – – – Not just about digital content or repository Address challenges Fulfill mission Provide services for our communities Governance and Work Governance Budget/Finances Decision-making Strategic Advisory Board Guidance on Policy, Planning Executive Committee HathiTrust Governance Executive Committee • • • • • • Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive Director, CDL John King, Vice Provost for Academic Information, UM Paula Kaufman, University Librarian and Dean of Libraries, UI Brian Schottlaender, University Librarian, UCSD Ed Van Gemert, Deputy Director of Libraries, UW – Madison (ex officio) • Brenda Johnson, Dean of Libraries, IU • Brad Wheeler, Chief Information Officer, IU • John Wilkin, Executive Director of HathiTrust and Associate University Librarian, LIT, UM Executive Committee Strategic Advisory Board • Ed Van Gemert (Chair), Deputy Director of Libraries, UW Madison • John Butler, Associate University Librarian for Information Technology, U Minn • Patricia Cruse, Director, Preservation, CDL • Bernie Hurley, Director, Library Technologies, UC Berkeley • R. Bruce Miller, University Librarian, UC - Merced • Sarah Pritchard, University Librarian, Northwestern • Paul Soderdahl, Director, LIT, U Iowa • John Wilkin, Executive Director, HathiTrust (ex officio) • Robert Wolven, Columbia University Strategic Advisory Board Constitutional Convention • October 2011 • Delegates from each institution and consortium – Carry certain number of votes determined according to formula approved by Executive Committee • 3-year review • Proposals – Print management – Ballot proposals How does work get done? • Collective work – e.g., working groups – Perform the work of the partnership – Now 40+ people across partner institutions • Distributed work – Driven by needs of institutions – able to leverage across the partnership – Projects, e.g. grant work, ingest specifications, page-turner, bibliographic data management • Leverage expertise across institutions Working Groups and Committees | Projects Working Groups (1) • Operational focus – Appointed by Executive Director in coordination with Executive Committee – Current • Usability • User Support • Communications – Previous • Development Environment • Storage • Research Center Working Groups (2) • Planning or Exploratory focus – Appointed by Strategic Advisory Board – Recommendations reviewed by SAB and XCom; may call for subsequent implementation • • • • Collections Committee Surrogates Quality, Ingest, and Error rate Discovery How is work prioritized? • Initial functional objectives • Collective processes – Working groups and committees Functional Objectives | Working Groups and Committees Governance Budget, Finances Decision-making Policy Enterprise Management Repository Administration Repository Administration Communication and Coordination with partner institutions Hardware configuration and maintenance Data management (content storage, backup, integrity checks, deletion) Project management Planning Web and application server configuration and maintenance Security Hardware selection and replacement Content and Metadata specifications Permissions Rights Management Bibliographic Data Management Copyright determination Entity description (record-level) Copyright review Object identification (item-level) Copyright information management (database) Data availability Collection Development Digital • Expansion beyond books and journals (born-digital, images and maps, audio) • Selection of content (for nonGoogle volume ingest and pilots projects) Print • Cloud Library (effect of digital on print) Rightsholder permissions Disaster Recovery Logging Processes for ensuring content integrity e-Commerce Print on Demand Content Ingest Content Access Quality Assurance User Services Transformation PageTurner Quality Review Usability Validation Collection Builder Content Certification User support (helpdesk) Large-scale Search Financial contributions of partners Research Center Bibliographic Catalog HathiTrust Functional Framework Outreach Project website Monthly newsletter Papers and presentations Communication with potential partners Surveys, general inquiries APIs Functional Framework Repository evaluation and audit (e.g., DRAMBORA, TRAC) Legal Risk management (use of materials) Partner agreements Advocacy Partnership Partnership • Who can become a partner? – Institutions worldwide – Libraries with print holdings Eligibility and Agreements What are the benefits? (1) • Cost-effective long-term preservation and access services for digitized content – Commitments on digital content facilitate decisions about digitization efforts and print collection management • For those with content, immediately offering long-term preservation, bibliographic and full-text search, collection-building • With content or not, full viewing and downloading capabilities for public domain materials and materials for which we have received permissions Features and Benefits | New Cost Model FAQ What are the benefits? (2) • Specialized access to public domain and in-copyright materials for users with print disabilities • Other lawful uses of in copyright materials such as Section 108 uses (print replacement copies, digital access to applicable works) • HathiTrust encourages participation in initiatives and resources geared toward – Shared collection development and management (e.g., copyright review work, print holdings database, de-duplication, collaboration with other organizations and initiatives) – Participation in governance and collaborative initiatives – Defining future directions of the shared library. What’s involved? • Contract – Sustaining – Content-Contributing • Yearly fees • Commitment – 5-year periods • Shibboleth • Print Holdings How much does it cost? (1) Cost How much does it cost? (2) • $0.149/volume/year for Google-digitized • $0.489/volume/year for IA-digitized • $0.154/volume/year for all content • $3.40 per GB How does it work? (1) • Sustaining membership is base – Pricing model for all partners beginning 2013 – Based on overlap of HathiTrust volumes with institutions’ print holdings – Share in infrastructure costs for public domain volumes: • (PD*X*C)/N – Share in infrastructure costs for in copyright volumes based on holdings • For a given incopyright volume: • IC=(C*X)/H How does it work? (2) • Main factors in costs are – Amount of content – Number of partners – Also a flexible multiplier designed to pay for programmatic activities • Tend to result in lower costs and more benefits over time How does it work? (3) • In order to support these calculations – Need print holdings database (2013) – Update mechanisms – Manual remediation • Using estimates currently – Based on infrastructure costs of anticipated content – Estimated partnership growth – Institution total volume counts Cost How does it work? (4) • Does not exclude contribution of content • If contribute content, costs covered up to amount that would be paid as Sustaining partner – Barring additional costs that might be needed to accommodate content (e.g., specialized load routines, generation of OCR) • Above that, pay per-GB cost ($3.40) How does it work? (5) • Partners share in costs of sustaining common resource • Share in uses of relevant materials • Voice in future directions • Costs to institutions go down • Quality of services increases – Realize in aggregated collection, something don’t get through distributed search or federation • Free riders? Changing Library Landscape • Rapidly changing landscape • Libraries are making these decisions but they are more and more collective decisions • We cannot afford anymore to do work separately that could be done collaboratively HathiTrust overall benefits to libraries • Digital Curation – – – – – – Drive costs down Reduce “bibliographic indeterminacy” Make meaningful decisions about formats and quality Increase discoverability, use Consolidate development talent Improve strength of archiving • Print Curation – Means to associate our print holdings – Coordinated record-keeping • Subsidiary benefits – Quantify problems – Collective attention to solving shared problems How to find out more • Web site “About” section: http://www.hathitrust.org/about • Twitter: http://twitter.com/hathitrust • Monthly newsletter: http://www.hathitrust.org/updates • RSS: http://www.hathitrust.org/updates_rss • Contact us: feedback@info.hathitrrust.org • Soon: Facebook, blog Thank you very much