HathiTrust Open Webinar (slides)

advertisement
HATHI TRUST
A Shared Digital Repository
HathiTrust Open Webinar
Jeremy York
Project Librarian, HathiTrust
May 3 and 5, 2011
Outline
•
•
•
•
•
•
•
Overview
Mission and Goals
Content
Services
Governance, how the partnership operates
Partnership
Changing Library Landscape
About
Current Partners
Arizona State University
Baylor University
California Digital Library
Columbia University
Cornell University
Dartmouth College
Duke University
Emory University
Harvard University Library
Indiana University
Johns Hopkins University
Library of Congress
Massachusetts Institute of
Technology
Michigan State University
New York University
New York Public Library
North Carolina Central
University
North Carolina State University
Northwestern University
The Ohio State University
The Pennsylvania State
University
Princeton University
Purdue University
Stanford University
Texas A&M University
Universidad Complutense de
Madrid
University of California
Berkeley
Davis
Irvine
Los Angeles
Merced
Riverside
San Diego
San Francisco
Santa Barbara
Santa Cruz
The University of Chicago
University of Illinois
University of Illinois at Chicago
The University of Iowa
University of Maryland
University of Michigan
University of Minnesota
The University of North
Carolina at Chapel Hill
University of Pennsylvania
University of Pittsburgh
University of Utah
University of Virginia
University of Washington
University of WisconsinMadison
Utah State University
Yale University Library
HathiTrust Community
Mission
• To contribute to the common good by collecting,
organizing, preserving, communicating, and
sharing the record of human knowledge
Mission and Goals
HathiTrust
Universal Library
Common Goal
Single Entity, Many Partners
Goals
• Comprehensive collection
• Preservation…with Access
• Shared strategies
–
–
–
–
Collection management, development
Preservation
Copyright
Efficient user services
• Openness
Mission and Goals
Content
What is in HathiTrust?
•
•
•
•
8,625,158 Total volumes
2,297,041 Public Domain
4,722,664 Book titles
209,930 Serial titles
* As of May 1, 2011
Content Sources
* As of May 1, 2011
Content Distribution
* As of May 1, 2011
Dates
* As of May 1, 2011
Statistics and Visualizations
Breakdown of HathiTrust book corpus by publication date
Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011
Breakdown of HathiTrust book corpus by publication date
Language Distribution (1)
The top 10 languages make up
~86% of all content
* As of May 1, 2011
Statistics and Visualizations
Language Distribution (2)
The next 40
languages make
up ~13% of total
* As of May 1, 2011
Statistics and Visualizations
Content over time
100%
Chicago
90%
Madrid
80%
Columbia
70%
LoC
Harvard
60%
Minnesota
50%
Indiana
40%
Princeton
NYPL
30%
Cornell
20%
Wisconsin
10%
California
0%
Michigan
* As of May 1, 2011
Content Growth
A global change in the library environment
60%
Academic print book collection already substantially
duplicated in mass digitized book corpus
50%
% of Titles in Local Collection
June 2010
Median duplication: 31%
40%
30%
20%
June 2009
Median duplication: 19%
10%
0%
0
20
40
60
80
Rank in 2008 ARL Investment Index
100
120
Digitized Books in Shared Repositories
~3.5M titles
3,500,000
3,000,000
~75% of mass digitized corpus is ‘backed up’ in one
or more shared print repositories
~2.5M
Unique Titles
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
Sep-09
Oct-09
Nov-09
Dec-09
Mass digitized books in Hathi digital repository
Jan-10
Feb-10
Mar-10
Apr-10
May-10
Jun-10
Mass digitized books in shared print repositories
Services
Services (1)
• Ingest
– Book and Journal content
• Google
• Internet Archive
• In-house, other vendor digitization
– Images, Audio, Born digital (coming soon…)
• Two parts
– Bibliographic Data
– Content
Getting Content Into HathiTrust | Building a Future by Preserving our Past
Services (2)
• Long-term preservation
– Bit-level, migration
– Standard and open formats (ITU G4 TIFF,
JPEG2000, JPG, Unicode)
– Validation, integrity, redundancy
– OAIS
• How reliable is it?
– DRAMBORA, TRAC
Preservation | Technology | TRAC
Technology - OAIS
MARC record extensions
(Aleph)
Rights DB
GROOVE
(JHOVE)
Page Turner
HathiTrust API
OAI
GeoIP DB
CNRI Handles
[Solr]
Google
Internet Archive
In-house
Conversion
;
GRIN
Internal Data Loading
METS/PREMIS object
TIFF G4/JPEG2000
OCR
MD5 checksums
Isilon
Site Replication
TSM
MD5 checksum validation
Technology
METS object
PNG
OCR
PDF
Quality
•
•
•
•
Partner Digitization
Google Digitization
Quality work / Volume certification
feedback@issues.hathitrust.org
Quality
Services (3)
• Preservation…with Access
– As part of preservation, service to partners, and as
public good
– Discovery
• Bibliographic (temporary catalog, OCLC/HathiTrust
catalog)
• Full-text
– Reading
• Interface optimized for users with print disabilities
– Collections
Searching, Reading, and Building Collections
Access Matrix
Type of
work
Public
domain
worldwide
Public
domain in
the US
Search –
Bib and
Full text
World
View
Full-PDF
download
Print on
Demand
World
World
World
US
World if no
restrictions,
Partners if
restrictions
US if no
restrictions,
US partners
if restrictions
World if no
restrictions
Open
World
Access
(+Creative
Commons)
In
World
copyright
(and
undetermin
ed)
World
US
Print
Section 108
disabilities (preservation
uses)
Partners
N/A
worldwide
US
Partners
World with Partners
permission worldwide
if no
restrictions
Not
Not available Not
Partners
available
available
US and
worldwide,
where
applicable
N/A
N/A
Partners US
and
worldwide,
where
applicable
Services (4)
• Rights Management
– Rights Database
– Copyright review
• IMLS Grant awarded to University of Michigan 2008 to
determine copyright status of books published in US
between 1923 and 1963
• 18 staff members, 4 institutions
–
–
–
–
Indiana University
University of Michigan
University of Minnesota
University of Wisconsin
• 125k reviewed through CRMS
• 67,000 (54%) in public domain
Copyright
Copyright status of books published pre-1923 and US works
published 1923-1963
Copyright status of books published pre-1923 and US works
published 1923-1963
Services (5)
• Data Availability
– Tab-delimited inventory files
– Bibliographic API
– Data API
– OAI feed of public domain
– SFX target
– Summon
Hathifiles | Data Distribution and APIs
Services (6)
• Collaborative Development Environment
– Active repository development
• Support for Computational Research
– Datasets
• 120,000-volume set
• Google-digitized public domain
– Protocol-based access
– Research Center
Datasets
How Different from Google?
•
•
•
•
•
•
Preservation
Content
Collective work
Uses of materials
Own trajectory
Partnership
–
–
–
–
Not just about digital content or repository
Address challenges
Fulfill mission
Provide services for our communities
Governance and Work
Governance
Budget/Finances
Decision-making
Strategic
Advisory Board
Guidance on
Policy,
Planning
Executive
Committee
HathiTrust
Governance
Executive Committee
•
•
•
•
•
•
Paul Courant, University Librarian and Dean of Libraries, UM
Laine Farley, Executive Director, CDL
John King, Vice Provost for Academic Information, UM
Paula Kaufman, University Librarian and Dean of Libraries, UI
Brian Schottlaender, University Librarian, UCSD
Ed Van Gemert, Deputy Director of Libraries, UW – Madison
(ex officio)
• Brenda Johnson, Dean of Libraries, IU
• Brad Wheeler, Chief Information Officer, IU
• John Wilkin, Executive Director of HathiTrust and
Associate University Librarian, LIT, UM
Executive Committee
Strategic Advisory Board
• Ed Van Gemert (Chair), Deputy Director of Libraries, UW Madison
• John Butler, Associate University Librarian for Information
Technology, U Minn
• Patricia Cruse, Director, Preservation, CDL
• Bernie Hurley, Director, Library Technologies, UC Berkeley
• R. Bruce Miller, University Librarian, UC - Merced
• Sarah Pritchard, University Librarian, Northwestern
• Paul Soderdahl, Director, LIT, U Iowa
• John Wilkin, Executive Director, HathiTrust (ex officio)
• Robert Wolven, Columbia University
Strategic Advisory Board
Constitutional Convention
• October 2011
• Delegates from each institution and
consortium
– Carry certain number of votes determined
according to formula approved by Executive
Committee
• 3-year review
• Proposals
– Print management
– Ballot proposals
How does work get done?
• Collective work
– e.g., working groups
– Perform the work of the partnership
– Now 40+ people across partner institutions
• Distributed work
– Driven by needs of institutions – able to leverage
across the partnership
– Projects, e.g. grant work, ingest specifications,
page-turner, bibliographic data management
• Leverage expertise across institutions
Working Groups and Committees | Projects
Working Groups (1)
• Operational focus
– Appointed by Executive Director in coordination
with Executive Committee
– Current
• Usability
• User Support
• Communications
– Previous
• Development Environment
• Storage
• Research Center
Working Groups (2)
• Planning or Exploratory focus
– Appointed by Strategic Advisory Board
– Recommendations reviewed by SAB and XCom;
may call for subsequent implementation
•
•
•
•
Collections Committee
Surrogates
Quality, Ingest, and Error rate
Discovery
How is work prioritized?
• Initial functional objectives
• Collective processes
– Working groups and committees
Functional Objectives | Working Groups and Committees
Governance
Budget, Finances
Decision-making
Policy
Enterprise
Management
Repository
Administration
Repository
Administration
Communication
and Coordination
with partner
institutions
Hardware
configuration and
maintenance
Data management
(content storage,
backup, integrity
checks, deletion)
Project
management
Planning
Web and
application server
configuration and
maintenance
Security
Hardware selection
and replacement
Content and
Metadata
specifications
Permissions
Rights
Management
Bibliographic
Data
Management
Copyright
determination
Entity description
(record-level)
Copyright review
Object
identification
(item-level)
Copyright
information
management
(database)
Data availability
Collection
Development
Digital
• Expansion beyond
books and journals
(born-digital,
images and maps,
audio)
• Selection of
content (for nonGoogle volume
ingest and pilots
projects)
Print
• Cloud Library (effect
of digital on print)
Rightsholder
permissions
Disaster Recovery
Logging
Processes for
ensuring content
integrity
e-Commerce
Print on Demand
Content Ingest
Content Access
Quality
Assurance
User Services
Transformation
PageTurner
Quality Review
Usability
Validation
Collection Builder
Content
Certification
User support
(helpdesk)
Large-scale Search
Financial
contributions
of partners
Research Center
Bibliographic
Catalog
HathiTrust Functional
Framework
Outreach
Project website
Monthly
newsletter
Papers and
presentations
Communication
with potential
partners
Surveys, general
inquiries
APIs
Functional Framework
Repository
evaluation and
audit (e.g.,
DRAMBORA,
TRAC)
Legal
Risk management
(use of materials)
Partner
agreements
Advocacy
Partnership
Partnership
• Who can become a partner?
– Institutions worldwide
– Libraries with print holdings
Eligibility and Agreements
What are the benefits? (1)
• Cost-effective long-term preservation and access services
for digitized content
– Commitments on digital content facilitate decisions about
digitization efforts and print collection management
• For those with content, immediately offering long-term
preservation, bibliographic and full-text search,
collection-building
• With content or not, full viewing and downloading
capabilities for public domain materials and materials for
which we have received permissions
Features and Benefits | New Cost Model FAQ
What are the benefits? (2)
• Specialized access to public domain and in-copyright materials
for users with print disabilities
• Other lawful uses of in copyright materials such as Section
108 uses (print replacement copies, digital access to
applicable works)
• HathiTrust encourages participation in initiatives and
resources geared toward
– Shared collection development and management (e.g., copyright
review work, print holdings database, de-duplication, collaboration
with other organizations and initiatives)
– Participation in governance and collaborative initiatives
– Defining future directions of the shared library.
What’s involved?
• Contract
– Sustaining
– Content-Contributing
• Yearly fees
• Commitment
– 5-year periods
• Shibboleth
• Print Holdings
How much does it cost? (1)
Cost
How much does it cost? (2)
• $0.149/volume/year for Google-digitized
• $0.489/volume/year for IA-digitized
• $0.154/volume/year for all content
• $3.40 per GB
How does it work? (1)
• Sustaining membership is base
– Pricing model for all partners beginning 2013
– Based on overlap of HathiTrust volumes with
institutions’ print holdings
– Share in infrastructure costs for public domain
volumes:
• (PD*X*C)/N
– Share in infrastructure costs for in copyright
volumes based on holdings
• For a given incopyright volume:
• IC=(C*X)/H
How does it work? (2)
• Main factors in costs are
– Amount of content
– Number of partners
– Also a flexible multiplier designed to pay for
programmatic activities
• Tend to result in lower costs and more
benefits over time
How does it work? (3)
• In order to support these calculations
– Need print holdings database (2013)
– Update mechanisms
– Manual remediation
• Using estimates currently
– Based on infrastructure costs of anticipated
content
– Estimated partnership growth
– Institution total volume counts
Cost
How does it work? (4)
• Does not exclude contribution of content
• If contribute content, costs covered up to
amount that would be paid as Sustaining
partner
– Barring additional costs that might be needed to
accommodate content (e.g., specialized load
routines, generation of OCR)
• Above that, pay per-GB cost ($3.40)
How does it work? (5)
• Partners share in costs of sustaining common
resource
• Share in uses of relevant materials
• Voice in future directions
• Costs to institutions go down
• Quality of services increases
– Realize in aggregated collection, something don’t
get through distributed search or federation
• Free riders?
Changing Library Landscape
• Rapidly changing landscape
• Libraries are making these decisions but they
are more and more collective decisions
• We cannot afford anymore to do work
separately that could be done collaboratively
HathiTrust overall benefits to libraries
• Digital Curation
–
–
–
–
–
–
Drive costs down
Reduce “bibliographic indeterminacy”
Make meaningful decisions about formats and quality
Increase discoverability, use
Consolidate development talent
Improve strength of archiving
• Print Curation
– Means to associate our print holdings
– Coordinated record-keeping
• Subsidiary benefits
– Quantify problems
– Collective attention to solving shared problems
How to find out more
• Web site “About” section:
http://www.hathitrust.org/about
• Twitter: http://twitter.com/hathitrust
• Monthly newsletter:
http://www.hathitrust.org/updates
• RSS: http://www.hathitrust.org/updates_rss
• Contact us: feedback@info.hathitrrust.org
• Soon: Facebook, blog
Thank you very much
Download