New Partners Webinar - March-April 2011

advertisement
HATHI TRUST
A Shared Digital Repository
HathiTrust Overview
Julie Bobay, Heather Christenson, and John Wilkin
April 12, 2011
HathiTrust Overview
•
•
•
•
•
•
Our organization and how it functions
Our HathiTrust collection
Perspectives on HathiTrust and public services
Leveraging HathiTrust data
How HathiTrust can make a difference
How to find out more
HathiTrust
Universal Library
Common Goal
Single Entity, Many Partners
Current Partners
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Arizona State University
Baylor University
California Digital Library
Columbia University
Cornell University
Dartmouth College
Duke University
Emory University
Harvard University Library
Indiana University
Johns Hopkins University
Library of Congress
Massachusetts Institute of Technology
Michigan State University
New York University
New York Public Library
North Carolina Central University
North Carolina State University
Northwestern University
The Ohio State University
The Pennsylvania State University
Princeton University
Purdue University
Stanford University
Texas A&M University
Universidad Complutense de Madrid
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
University of California Berkeley
University of California Davis
University of California Irvine
University of California Los Angeles
University of California Merced
University of California Riverside
University of California San Diego
University of California San Francisco
University of California Santa Barbara
University of California Santa Cruz
The University of Chicago
University of Illinois
University of Illinois at Chicago
The University of Iowa
University of Maryland
University of Michigan
University of Minnesota
The University of North Carolina at Chapel Hill
University of Pennsylvania
University of Pittsburgh
University of Utah
University of Virginia
University of Washington
University of Wisconsin-Madison
Utah State University
Yale University Library
Governance
Budget/Finances
Decision-making
Strategic
Advisory
Board
Executive
Committee
HathiTrust
Guidance on
Policy,
Planning
Executive Committee
•
•
•
•
•
•
Paul Courant, University Librarian and Dean of Libraries, UM
Laine Farley, Executive Director, CDL
John King, Vice Provost for Academic Information, UM
Paula Kaufman, University Librarian and Dean of Libraries, UI
Brian Schottlaender, University Librarian, UCSD
Ed Van Gemert, Deputy Director of Libraries, UW – Madison
(ex officio)
• Brenda Johnson, Dean of Libraries, IU
• Brad Wheeler, Chief Information Officer, IU
• John Wilkin, Executive Director of HathiTrust and
Associate University Librarian, LIT, UM
Strategic Advisory Board
• Ed Van Gemert (Chair), Deputy Director of Libraries, UW Madison
• John Butler, Associate University Librarian for Information
Technology, U Minn
• Patricia Cruse, Director, Preservation, CDL
• Bernie Hurley, Director, Library Technologies, UC Berkeley
• R. Bruce Miller, University Librarian, UC - Merced
• Sarah Pritchard, University Librarian, Northwestern
• Paul Soderdahl, Director, LIT, U Iowa
• John Wilkin, Executive Director, HathiTrust (ex officio)
• Robert Wolven, Columbia University
Working Groups
• Appointed by Strategic Advisory Board and
Executive Committee
• Both operational and strategically-focused
groups
• Collections, Communications, Discovery
Interface, Full-text Search, Usability, User
Support
• Now 40+ people across the country
• Expertise from across the partnership
Staff
• Staff/Expertise – highly integrated
– Project managers, IT and communications
staff, copyright experts, administrators
– Working groups
• Shared development space
Governance
Budget, Finances
Decision-making
Policy
Enterprise
Management
Repository
Administration
Repository
Administration
Communication
and Coordination
with partner
institutions
Hardware
configuration and
maintenance
Data management
(content storage,
backup, integrity
checks, deletion)
Project
management
Planning
Web and
application server
configuration and
maintenance
Security
Hardware selection
and replacement
Content and
Metadata
specifications
Permissions
Rights
Management
Bibliographic
Data
Management
Copyright
determination
Entity description
(record-level)
Copyright review
Object
identification
(item-level)
Copyright
information
management
(database)
Data availability
Collection
Development
Digital
• Expansion beyond
books and journals
(born-digital,
images and maps,
audio)
• Selection of
content (for nonGoogle volume
ingest and pilots
projects)
Print
• Cloud Library (effect
of digital on print)
Rightsholder
permissions
Disaster Recovery
Logging
Processes for
ensuring content
integrity
e-Commerce
Print on Demand
Content Ingest
Content Access
Quality
Assurance
User Services
Transformation
PageTurner
Quality Review
Usability
Validation
Collection Builder
Content
Certification
User support
(helpdesk)
Large-scale Search
Financial
contributions
of partners
Research Center
Bibliographic
Catalog
APIs
Outreach
Project website
Monthly
newsletter
Papers and
presentations
HathiTrust Functional
Framework
Communication
with potential
partners
Surveys, general
inquiries
Repository
evaluation and
audit (e.g.,
DRAMBORA,
TRAC)
Legal
Risk management
(use of materials)
Partner
agreements
Advocacy
What work is there?
•
•
•
•
•
•
•
Usage Reporting
Quality
Copyright Review
Specifications
Metadata
Development Environment
Other?
Basic Infrastructure Costs
Cost Model 1
• Economies of scale keep costs low
– $0.149/volume/year for Google-digitized
– $0.489/volume/year for IA-digitized
– $0.154/volume/year for all content
• Advantages not fully known until you jump in
A global change in the library environment
60%
Academic print book collection already substantially
duplicated in mass digitized book corpus
50%
% of Titles in Local Collection
June 2010
Median duplication: 31%
40%
30%
20%
June 2009
Median duplication: 19%
10%
0%
0
20
40
60
80
Rank in 2008 ARL Investment Index
100
120
Digitized Books in Shared Repositories
~3.5M titles
3,500,000
3,000,000
~75% of mass digitized corpus is ‘backed up’ in
one or more shared print repositories
~2.5M
Unique Titles
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
Sep-09
Oct-09
Nov-09
Dec-09
Mass digitized books in Hathi digital repository
Jan-10
Feb-10
Mar-10
Apr-10
May-10
Jun-10
Mass digitized books in shared print repositories
Cost Model 2
For public domain volumes:
(PD*X*C)/N
For a given incopyright volume:
IC=(C*X)/H
•
•
•
•
Share in costs of curation
Share in uses of relevant materials
Voice in future directions
Free riders?
Cost Model 2
• Sustaining common resource
• Costs go down
• Quality of services increases
– Realize in aggregated collection, something don’t
get through distributed search or federation
Cost Model 2: Timeline &
Requirements
• Timeline:
– Implement in 2013
– Accept new partners now with costs based on
overlap calculations
• Requirements:
– Print holdings database
– Update mechanisms
– Manual remediation
Print Holdings Database
• Print holdings database will also benefit
– De-duplication
• Compromises user experience, obscures collection
development needs
– Management of print volumes
• Information to withdraw volumes (journals)
– Legal uses of copyright materials
• Section 108, 121, ADA uses will depend knowledge of
which institutions own(ed) which materials
Questions?
Our HathiTrust Collection
Content Distribution
8,234,081 – Total volumes
2,102,033 – Public Domain
4,527,381 Book titles
202,649 Serial titles
* As of March 5, 2011
Language Distribution (1)
The top 10 languages make up
~86% of all content
* As of March 5, 2011
Language Distribution (2)
The next 40
languages make
up ~13% of total
* As of March 5, 2011
Dates
* As of March 5, 2011
Originating Institution
* As of March 5, 2011
Content over time
100%
90%
Madrid
Illinois
80%
Penn State
70%
Chicago
60%
Cornell
Princeton
50%
Columbia
40%
Minnesota
30%
NYPL
20%
Indiana
Wisconsin
10%
California
0%
Michigan
* As of March 5, 2011
Content Growth
Collection Development and
Management
Collections Committee
• Appropriate principles for duplicate volumes
• Print management proposal
• Prioritization of collection development activities
• Process for decision-making and prioritization for
new content types
• Recommendations for tools and services
• Prioritization of copyright review and rightsclearing processes
What about quality?
•
•
•
•
•
Validation upon ingest
Gating on metrics from Google
Updated versions from Google
Proactive work by Google library partners
IMLS grant to develop framework and methodology for
validating content in large-scale digital repositories
• Crowd sourcing in our future?
Questions?
Perspectives on HathiTrust and public
services
HathiTrust and Reference
• HathiTrust: like Google and licensed databases
– very large, rich repositories of content, with
services supporting their use
• Reference librarians
– are intermediaries between all these resources
and researchers who use them
HathiTrust as a Reference Source
• HathiTrust is CONSTANTLY changing
• Requirement that’s not new to reference
librarians, but greatly increased:
Stay engaged. Read updates. Use it.
HathiTrust is DIFFERENT
• We are THE PRODUCERS of this resource
– HathiTrust is OUR COLLECTION
– New role - not recipient/grader/purchaser
– WE build this resource
• Close engagement of sort we have not
experienced before
HathiTrust and Google Books
Fact: content in HathiTrust, by the numbers, is
currently largely a subset of Google Books
That’s how we started
BUT
It’s just the start
HathiTrust stands on its own Content
• HathiTrust content has been curated over time
by librarians
– Mirrors collections of large research libraries
– Focus on quality
• Expanding Non-Google content
– Public Domain: Copyright Review Management
System
– Content from non-Google sources
• Internet Archive, image collections, government
Copyright Review Management
System
– IMLS Grant awarded to University of Michigan
2008 to determine copyright status of books
published in US between 1923 and 1963
– Wisconsin, Minnesota and Indiana each devote 1
FTE to this effort for Phase 3, 2010-2011
– As of March, 2011, over 125,000 volumes
reviewed; 54% opened up in HathiTrust
HathiTrust stands on its own Functionality
HathiTrust supports scholarship
•
•
•
•
Proper metadata
User interface designed for scholarly work
Services for people with visual impairments
Large-scale text mining
HathiTrust stands on its own Services
• Collection builder
• Member services (via Shibolleth logons)
– download full PDF’s
– create permanent collections
How do people use HathiTrust?
• Of course, to read public domain books and
journals
• But much more
Use stories
“I now go to HathiTrust as my first destination
for in-depth reference questions. Fantastic
searchable corpus; good metadata; content
and functionality designed for scholarly
needs.”
Indiana University librarian
Use stories (2)
• Complete Works of Voltaire (52-volume set
published in late 19th century)
– scholar needed all volumes to do scholarly
referencing from home
– all in HathiTrust presented together under a single
MARC record
Use stories (3)
• Open Folklore – a new way to use HathiTrust
– Portal that provides access to open access
published and unpublished folklore literature
– Indiana University’s Folklore Collection first CIC
“Collection of Distinction” in Google
– HathiTrust – the “corner store” in the shopping
mall of digital repositories
– Anchor for whole set of services and initiatives,
including journal liberation projects
http://www.openfolklore.org
Questions?
Leveraging HathiTrust data
A bibliographic metadata moment
• Bib data for each digital volume must be present in
HathiTrust in order for volumes to be ingested
• Depositors make bib data available to UM to be
loaded into HathiTrust bibliographic management
system
• Info in the submitted bib records is used to make an
initial rights determination about each volume
• The bib record acts as a manifest for the digital
content that is then ingested
• A “snapshot in time” of the bib data associated with
an object is also stored in the preservation metadata
HathiTrust makes our data
available
Goal is to extend possibilities for development of
local services and other uses
• Bibliographic API
• Data API
• OAI feed of public domain
• “Hathifiles”
• 120,000 public domain texts for computational
research
Some examples of use
Catalogs
• UM loaded every record
• Chicago links to public domain volumes owned in print
• OCLC loaded records into WorldCat
Link resolvers
• UC created SFX target
Vendors
• H.W. Wilson databases linked to public domain volumes
Needed: A guide with examples of how partners have used
the data!
Future Directions (1)
• Locally-digitized partner content
• Usage reporting
• Coordinate digital and print resources
(holdings database)
• Computational Research
• Quality
• Strategies for openness
• Collaborative Development
• Extending Services through Shibboleth
• Non-book, non-journal content
Future Directions (2)
•
•
•
•
•
•
•
•
•
Born-digital content (Publishing)
New Bibliographic Management
Compliance with TRAC
Grant projects
OCLC Catalog
3-year review
Improvements to Large-scale Search
Improvements to PageTurner
Ingest Reporting
How can HathiTrust make a
difference?
• Digital Curation
–
–
–
–
–
–
Drive costs down
Reduce “bibliographic indeterminacy”
Make meaningful decisions about formats and quality
Increase discoverability
Consolidate development talent
Improve strength of archiving
• Print Curation
– Means to associate our print holdings
– Coordinated record-keeping
• Subsidiary benefits
– Quantify problems
– Collective attention to solving shared problems
How to find out more
• Web site “About” section:
http://www.hathitrust.org/about
• Twitter: http://twitter.com/hathitrust
• RSS: http://www.hathitrust.org/updates_rss
• Monthly newsletter:
http://www.hathitrust.org/updates
• Contact us: hathitrust-info@umich.edu
• Soon: Facebook, blog
Download