HATHI TRUST A Shared Digital Repository HathiTrust Overview Julie Bobay, Heather Christenson, and John Wilkin April 12, 2011 HathiTrust Overview • • • • • • Our organization and how it functions Our HathiTrust collection Perspectives on HathiTrust and public services Leveraging HathiTrust data How HathiTrust can make a difference How to find out more HathiTrust Universal Library Common Goal Single Entity, Many Partners Current Partners • • • • • • • • • • • • • • • • • • • • • • • • • • Arizona State University Baylor University California Digital Library Columbia University Cornell University Dartmouth College Duke University Emory University Harvard University Library Indiana University Johns Hopkins University Library of Congress Massachusetts Institute of Technology Michigan State University New York University New York Public Library North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Texas A&M University Universidad Complutense de Madrid • • • • • • • • • • • • • • • • • • • • • • • • • • University of California Berkeley University of California Davis University of California Irvine University of California Los Angeles University of California Merced University of California Riverside University of California San Diego University of California San Francisco University of California Santa Barbara University of California Santa Cruz The University of Chicago University of Illinois University of Illinois at Chicago The University of Iowa University of Maryland University of Michigan University of Minnesota The University of North Carolina at Chapel Hill University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Yale University Library Governance Budget/Finances Decision-making Strategic Advisory Board Executive Committee HathiTrust Guidance on Policy, Planning Executive Committee • • • • • • Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive Director, CDL John King, Vice Provost for Academic Information, UM Paula Kaufman, University Librarian and Dean of Libraries, UI Brian Schottlaender, University Librarian, UCSD Ed Van Gemert, Deputy Director of Libraries, UW – Madison (ex officio) • Brenda Johnson, Dean of Libraries, IU • Brad Wheeler, Chief Information Officer, IU • John Wilkin, Executive Director of HathiTrust and Associate University Librarian, LIT, UM Strategic Advisory Board • Ed Van Gemert (Chair), Deputy Director of Libraries, UW Madison • John Butler, Associate University Librarian for Information Technology, U Minn • Patricia Cruse, Director, Preservation, CDL • Bernie Hurley, Director, Library Technologies, UC Berkeley • R. Bruce Miller, University Librarian, UC - Merced • Sarah Pritchard, University Librarian, Northwestern • Paul Soderdahl, Director, LIT, U Iowa • John Wilkin, Executive Director, HathiTrust (ex officio) • Robert Wolven, Columbia University Working Groups • Appointed by Strategic Advisory Board and Executive Committee • Both operational and strategically-focused groups • Collections, Communications, Discovery Interface, Full-text Search, Usability, User Support • Now 40+ people across the country • Expertise from across the partnership Staff • Staff/Expertise – highly integrated – Project managers, IT and communications staff, copyright experts, administrators – Working groups • Shared development space Governance Budget, Finances Decision-making Policy Enterprise Management Repository Administration Repository Administration Communication and Coordination with partner institutions Hardware configuration and maintenance Data management (content storage, backup, integrity checks, deletion) Project management Planning Web and application server configuration and maintenance Security Hardware selection and replacement Content and Metadata specifications Permissions Rights Management Bibliographic Data Management Copyright determination Entity description (record-level) Copyright review Object identification (item-level) Copyright information management (database) Data availability Collection Development Digital • Expansion beyond books and journals (born-digital, images and maps, audio) • Selection of content (for nonGoogle volume ingest and pilots projects) Print • Cloud Library (effect of digital on print) Rightsholder permissions Disaster Recovery Logging Processes for ensuring content integrity e-Commerce Print on Demand Content Ingest Content Access Quality Assurance User Services Transformation PageTurner Quality Review Usability Validation Collection Builder Content Certification User support (helpdesk) Large-scale Search Financial contributions of partners Research Center Bibliographic Catalog APIs Outreach Project website Monthly newsletter Papers and presentations HathiTrust Functional Framework Communication with potential partners Surveys, general inquiries Repository evaluation and audit (e.g., DRAMBORA, TRAC) Legal Risk management (use of materials) Partner agreements Advocacy What work is there? • • • • • • • Usage Reporting Quality Copyright Review Specifications Metadata Development Environment Other? Basic Infrastructure Costs Cost Model 1 • Economies of scale keep costs low – $0.149/volume/year for Google-digitized – $0.489/volume/year for IA-digitized – $0.154/volume/year for all content • Advantages not fully known until you jump in A global change in the library environment 60% Academic print book collection already substantially duplicated in mass digitized book corpus 50% % of Titles in Local Collection June 2010 Median duplication: 31% 40% 30% 20% June 2009 Median duplication: 19% 10% 0% 0 20 40 60 80 Rank in 2008 ARL Investment Index 100 120 Digitized Books in Shared Repositories ~3.5M titles 3,500,000 3,000,000 ~75% of mass digitized corpus is ‘backed up’ in one or more shared print repositories ~2.5M Unique Titles 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 Sep-09 Oct-09 Nov-09 Dec-09 Mass digitized books in Hathi digital repository Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Mass digitized books in shared print repositories Cost Model 2 For public domain volumes: (PD*X*C)/N For a given incopyright volume: IC=(C*X)/H • • • • Share in costs of curation Share in uses of relevant materials Voice in future directions Free riders? Cost Model 2 • Sustaining common resource • Costs go down • Quality of services increases – Realize in aggregated collection, something don’t get through distributed search or federation Cost Model 2: Timeline & Requirements • Timeline: – Implement in 2013 – Accept new partners now with costs based on overlap calculations • Requirements: – Print holdings database – Update mechanisms – Manual remediation Print Holdings Database • Print holdings database will also benefit – De-duplication • Compromises user experience, obscures collection development needs – Management of print volumes • Information to withdraw volumes (journals) – Legal uses of copyright materials • Section 108, 121, ADA uses will depend knowledge of which institutions own(ed) which materials Questions? Our HathiTrust Collection Content Distribution 8,234,081 – Total volumes 2,102,033 – Public Domain 4,527,381 Book titles 202,649 Serial titles * As of March 5, 2011 Language Distribution (1) The top 10 languages make up ~86% of all content * As of March 5, 2011 Language Distribution (2) The next 40 languages make up ~13% of total * As of March 5, 2011 Dates * As of March 5, 2011 Originating Institution * As of March 5, 2011 Content over time 100% 90% Madrid Illinois 80% Penn State 70% Chicago 60% Cornell Princeton 50% Columbia 40% Minnesota 30% NYPL 20% Indiana Wisconsin 10% California 0% Michigan * As of March 5, 2011 Content Growth Collection Development and Management Collections Committee • Appropriate principles for duplicate volumes • Print management proposal • Prioritization of collection development activities • Process for decision-making and prioritization for new content types • Recommendations for tools and services • Prioritization of copyright review and rightsclearing processes What about quality? • • • • • Validation upon ingest Gating on metrics from Google Updated versions from Google Proactive work by Google library partners IMLS grant to develop framework and methodology for validating content in large-scale digital repositories • Crowd sourcing in our future? Questions? Perspectives on HathiTrust and public services HathiTrust and Reference • HathiTrust: like Google and licensed databases – very large, rich repositories of content, with services supporting their use • Reference librarians – are intermediaries between all these resources and researchers who use them HathiTrust as a Reference Source • HathiTrust is CONSTANTLY changing • Requirement that’s not new to reference librarians, but greatly increased: Stay engaged. Read updates. Use it. HathiTrust is DIFFERENT • We are THE PRODUCERS of this resource – HathiTrust is OUR COLLECTION – New role - not recipient/grader/purchaser – WE build this resource • Close engagement of sort we have not experienced before HathiTrust and Google Books Fact: content in HathiTrust, by the numbers, is currently largely a subset of Google Books That’s how we started BUT It’s just the start HathiTrust stands on its own Content • HathiTrust content has been curated over time by librarians – Mirrors collections of large research libraries – Focus on quality • Expanding Non-Google content – Public Domain: Copyright Review Management System – Content from non-Google sources • Internet Archive, image collections, government Copyright Review Management System – IMLS Grant awarded to University of Michigan 2008 to determine copyright status of books published in US between 1923 and 1963 – Wisconsin, Minnesota and Indiana each devote 1 FTE to this effort for Phase 3, 2010-2011 – As of March, 2011, over 125,000 volumes reviewed; 54% opened up in HathiTrust HathiTrust stands on its own Functionality HathiTrust supports scholarship • • • • Proper metadata User interface designed for scholarly work Services for people with visual impairments Large-scale text mining HathiTrust stands on its own Services • Collection builder • Member services (via Shibolleth logons) – download full PDF’s – create permanent collections How do people use HathiTrust? • Of course, to read public domain books and journals • But much more Use stories “I now go to HathiTrust as my first destination for in-depth reference questions. Fantastic searchable corpus; good metadata; content and functionality designed for scholarly needs.” Indiana University librarian Use stories (2) • Complete Works of Voltaire (52-volume set published in late 19th century) – scholar needed all volumes to do scholarly referencing from home – all in HathiTrust presented together under a single MARC record Use stories (3) • Open Folklore – a new way to use HathiTrust – Portal that provides access to open access published and unpublished folklore literature – Indiana University’s Folklore Collection first CIC “Collection of Distinction” in Google – HathiTrust – the “corner store” in the shopping mall of digital repositories – Anchor for whole set of services and initiatives, including journal liberation projects http://www.openfolklore.org Questions? Leveraging HathiTrust data A bibliographic metadata moment • Bib data for each digital volume must be present in HathiTrust in order for volumes to be ingested • Depositors make bib data available to UM to be loaded into HathiTrust bibliographic management system • Info in the submitted bib records is used to make an initial rights determination about each volume • The bib record acts as a manifest for the digital content that is then ingested • A “snapshot in time” of the bib data associated with an object is also stored in the preservation metadata HathiTrust makes our data available Goal is to extend possibilities for development of local services and other uses • Bibliographic API • Data API • OAI feed of public domain • “Hathifiles” • 120,000 public domain texts for computational research Some examples of use Catalogs • UM loaded every record • Chicago links to public domain volumes owned in print • OCLC loaded records into WorldCat Link resolvers • UC created SFX target Vendors • H.W. Wilson databases linked to public domain volumes Needed: A guide with examples of how partners have used the data! Future Directions (1) • Locally-digitized partner content • Usage reporting • Coordinate digital and print resources (holdings database) • Computational Research • Quality • Strategies for openness • Collaborative Development • Extending Services through Shibboleth • Non-book, non-journal content Future Directions (2) • • • • • • • • • Born-digital content (Publishing) New Bibliographic Management Compliance with TRAC Grant projects OCLC Catalog 3-year review Improvements to Large-scale Search Improvements to PageTurner Ingest Reporting How can HathiTrust make a difference? • Digital Curation – – – – – – Drive costs down Reduce “bibliographic indeterminacy” Make meaningful decisions about formats and quality Increase discoverability Consolidate development talent Improve strength of archiving • Print Curation – Means to associate our print holdings – Coordinated record-keeping • Subsidiary benefits – Quantify problems – Collective attention to solving shared problems How to find out more • Web site “About” section: http://www.hathitrust.org/about • Twitter: http://twitter.com/hathitrust • RSS: http://www.hathitrust.org/updates_rss • Monthly newsletter: http://www.hathitrust.org/updates • Contact us: hathitrust-info@umich.edu • Soon: Facebook, blog