HathiTrust Digital Library

advertisement
HATHI TRUST
A Shared Digital Repository
HathiTrust Digital Library
Cooperation for Preservation
Outline
• About HathiTrust
– Mission & Goals
• Background
• What we do
– Services
• How we do it
– Governance
– Partnership & Resources
– Technology
• Future Directions
About
What is HathiTrust
• Shared Digital Repository
– Launched 2008 by 25 institutions (now 26)
– Initial focus on digitized book and journal content
– Expanding to non-book/non-journal, born digital
– “Light” archive
• Collaboration
– Preservation and access
– Print collections
– Local services
– Public Good
Background
History
• Michigan Digitization Project 2004
• “…U of M shall have the right to use the U of
M Digital Copy, in whole or in part at U of M's
sole discretion, as part of services offered in
cooperation with partner research libraries
such as the institutions in the Digital Library
Federation…”
History
• Collective Agreement with CIC Announced in
June 2007
• CIC agreed to establish a shared digital
repository
History
CIC Shared
Digital
Repository
HathiTrust
The Partners
• When announced in October 2008, partners
included:
– University of California system
– CIC (Committee on Institutional Cooperation)
University of Chicago
University of Illinois
Indiana University
University of Iowa
University of Michigan
Michigan State University
– University of Virginia
University of Minnesota
Northwestern University
Ohio State University
Pennsylvania State University
Purdue University
University of Wisconsin-Madison
Columbia University
The Name
• The meaning behind the name
– Hathi (hah-tee)--Hindi for elephant
– Big, strong
– Never forgets, wise
– Secure
– Trustworthy
Content Distribution
As of February 1:
5,323,716 - Total
764,481 - Public Domain
Content Growth
What we do
Services
• Bit-level
preservation and
• migration
Long-term
preservation
• Viewing
• Redistribution
• Print disabilities
• Section 108
Access
• Inbound validation
• Fixity checks
Google ingest
• Rights database
• Copyright review
• Collection Builder
• Metadata files
• Bib API
• Data API
Rights
management
Publish virtual
collections
Availability of
data
• Temporary catalog
• Version 1
permanent catalog
April 2010
• November 2009
• UM public domain
• UM Press
Bibliographic
search
Full-text
search
Print on
Demand
How we do it
Governance
Budget/Finances
Decision-making
Policy
Planning
Strategic
Advisory
Board
Executive
Committee
HathiTrust
Executive Committee
•
•
•
•
•
•
•
•
•
Paul Courant, University Librarian and Dean of Libraries, UM
Laine Farley, Executive Director, CDL
John King, Vice Provost for Academic Information, UM
Paula Kaufman, University Librarian and Dean of Libraries, UI
Brian Schottlaender, University Librarian, UCSD
Ed Van Gemert, Director of Libraries, UW - Madison
Brenda Johnson, Dean of Libraries, IU
Brad Wheeler, Chief Information Officer, IU
John Wilkin, Executive Director of HathiTrust and
Associate University Library, LIT, UM
Strategic Advisory Board
• Ed Van Gemert (Chair), Director of Libraries, UW - Madison
• John Butler, Associate University Librarian for Information
Technology, U Minn
• Patricia Cruse, Director, Preservation, CDL
• Bernie Hurley, Director, Library Technologies, UC Berkeley
• R. Bruce Miller, University Librarian, UC - Merced
• Sarah Pritchard, University Librarian, Northwestern
• Paul Soderdahl, Director, LIT, U Iowa
• John Wilkin, Executive Director, HathiTrust (ex officio)
Partnership & Resources (1)
• Funded for a initial 5 years with
base-funding from partners
• Budget – separately held within
UMich budget system, managed
by the Executive Committee
• Cost Model – Per GB cost of storage per year with a
one-time fee on new content to build a capital fund
• Review in 3rd yr of each 5 yr period
Partnership & Resources (2)
• Staff/Expertise – highly integrated
– Project managers, IT and communications
staff, copyright experts, administrators (UM,
Indiana and UC taking the lead)
• Working groups
• UM recently hired a Digital Preservation Librarian
• Shared development space
Governance
Budget, Finances
Decision-making
Policy
Enterprise
Management
Repository
Administration
Repository
Administration
Communication
and Coordination
with partner
institutions
Hardware
configuration and
maintenance
Data management
(content storage,
backup, integrity
checks, deletion)
Project
management
Planning
Web and
application server
configuration and
maintenance
Security
Hardware selection
and replacement
Content and
Metadata
specifications
Permissions
Rights
Management
Bibliographic
Data
Management
Copyright
determination
Entity description
(record-level)
Copyright review
Object
identification
(item-level)
Copyright
information
management
(database)
Data availability
Collection
Development
Digital
• Expansion beyond
books and journals
(born-digital,
images and maps,
audio)
• Selection of
content (for nonGoogle volume
ingest and pilots
projects)
Print
• Cloud Library (effect
of digital on print)
Rightsholder
permissions
Disaster Recovery
Logging
Processes for
ensuring content
integrity
e-Commerce
Print on Demand
Content Ingest
Content Access
Quality
Assurance
User Services
Transformation
PageTurner
Quality Review
Usability
Validation
Collection Builder
Content
Certification
User support
(helpdesk)
Large-scale Search
Financial
contributions
of partners
Research Center
Bibliographic
Catalog
APIs
Outreach
Project website
Monthly
newsletter
Papers and
presentations
HathiTrust Functional
Framework
Communication
with potential
partners
Surveys, general
inquiries
Repository
evaluation and
audit (e.g.,
DRAMBORA,
TRAC)
Legal
Risk management
(use of materials)
Partner
agreements
Advocacy
Partnership & Resources (3)
• Toward a Cloud Library
– CLIR, Mellon Foundation
– OCLC Research, NYU, HathiTrust, Recap Libraries
• Objective: Characterize the near-term opportunity for externalizing
management of academic research collections leveraging capacity
of large-scale shared print and digital repositories*
• Outcomes: opportunity and risk assessment based on aggregate
collection analysis; draft service agreement enabling generic
consumer library to selectively outsource preservation and access
of low-use research collections to large-scale print and digital
repositories
*From the RLG Partner Update January 7, 2010
Partnership & Resources (4)
• CRL TRAC Audit
– Portico and HathiTrust assessments timely
– “Certification will augment CRL’s strategic archiving of
print, and support a responsible transition to electroniconly formats where appropriate.”
– Work with UC to design shared print journal archiving
effort
– “With this hybrid strategy CRL hopes to enable its
community to accelerate the shift to electronic-only
resources in a careful and responsible manner.”
* http://www.crl.edu/archiving-preservation/digitalarchives/certification-and-assessment-digital-repositories
Partnership & Resources (5)
• New cost model
• Based on benefits to institutions
– Public Domain
– In-copyright
• Volumes “held”
Partnership & Resources (6)
• Timeline:
– Implement in 2013
– Accept new partners now with costs based on
overlap calculations
• Requirements:
– Print holdings database
– Update mechanisms
– Manual remediation
Technology - OAIS
MARC record extensions
(Aleph)
Rights DB
GROOVE
(JHOVE)
Page Turner
HathiTrust API
OAI
GeoIP DB
CNRI Handles
[Solr]
Google
[OCA]
In-house Conversion
;
GRIN
Internal Data Loading
METS/PREMIS object
TIFF G4/JPEG2000
OCR
MD5 checksums
Isilon
Site Replication
TSM
MD5 checksum validation
METS object
PNG
OCR
PDF
Technology – Architecture
• Inbound validation, standards-based
object storage and related metadata
• Storage in Ann Arbor and Indianapolis
• Encrypted backup to 3rd location
• Rights database for rights metadata
• Online catalog as source and storage for descriptive
metadata
Technology - Ingest
• Automatic validation in GROOVE
– Check barcode check digit using Luhn algorithm
– Fixity check on JPG2000, TIFF, UTF8 using MD5
– Well-formedness and embedded metadata check
on JPG2000, TIFF, UTF8 using JHOVE
• Creation of METS and PREMIS
Technology - Repository
• Isilon storage
• Simple filesystem layout
– One directory per volume, zip file and METS file
– Use of a namespace allows for conflicting
identifiers
– Namespaces for institutions and, if needed, types
of identifiers within the institution
Technology – METS Object
• Why METS?
– Can serve as Archival Information
Package and a Dissemination
Information Package
– Designed to record the relationship between
pieces of complex digital objects
– Can be created automatically as texts are loaded
or reloaded
– Preservation actions (PREMIS)
Technology – METS Object
• What’s there?
– metsHdr with an ID and CREATEDATE
– 2 dmdSecs: Marcxml and mdRef
– amdSec containing one techMD with PREMIS
metadata
– fileSec with 4 fileGrps (zip, images, OCR, hOCR)
– Physical structMap tying together files with
metadata (pg. numbers and features)
Future Directions
Future Directions (1)
• SAB
• SAB
• SAB
• SAB
• Current and
ongoing areas
3-year review
OCLC catalog
Quality
Deduplication
TRAC
compliance
• Full-PDF
• Collection Builder
• Section 108
• Users with print
disabilities
• IA-digitized
• locally-digitized
• Audio pilot
• Images (maps)
• Beginning to
investigate ePub
as a delivery
format
• Data API
Non-Google
print content
Nonbook/nonjournal
Born-digital
Openness
Shibboleth
Future Directions (2)
• PageTurner
• Advanced search
• Search facets
• Collection Builder
• Isilon software
• June 2010
• CB Integration
• Advanced search
• Index optimizing
• New hardware
• Wisconsin
• University of
California
Collaborative
Development
Fixity
checking
Large-scale
Search
Ingest
reporting
Bibliographic
management
• University of
California
• NSF EAGER
• Mellon Quality
• Partner
Institutions
• Partner
Institutions
• Research Center
• Data distribution
• Tools such as
SEASR
Content
validation
Grant
projects
Usage
reporting
Holdings
database
Data mining
tools
Links
• Catalog, Full-text search, and Collection Builder
– http://catalog.hathitrust.org
• METS and PREMIS implementation
– http://www.hathitrust.org/preservation
• Technical profile:
– http://www.hathitrust.org/technology
• Technical flow diagram
– http://www.hathitrust.org/documents/HathiTrust-PASIG-200910.pdf
– http://www.hathitrust.org/documents/HathiTrust-PASIG-notes200910.pdf
• Rights management
– http://www.hathitrust.org/rights_management
• TRAC
– http://www.hathitrust.org/accountability
Thank You!
hathitrust-info@umich.edu
jjyork@umich.edu
http://www.hathitrust.org
Download