PowerPoint - Columbia University Libraries

advertisement
HATHI TRUST
A Shared Digital Repository
Columbia University and HathiTrust
Collaboration at a new level
Outline
• About HathiTrust
– Mission & Goals
• Background
• What we do (services)
– Objectives
•
•
•
•
Governance
Partnership & Resources
Technology
Future Directions
About
What is HathiTrust
Universal Digital Library
Common Goal
Single Entity but Partnership of Many
Libraries
Goals
• Reliable and comprehensive archive of
materials converted from print…co-owned
• Ensure the long-term preservation of content
• Improve access …to meet the needs of the coowning institutions
• Coordinate shared storage strategies
• “public good” …sustaining the historical record
• Simultaneously …centralized …open
Background
History
• Michigan Digitization Project 2004
• “…U of M shall have the right to use the U of
M Digital Copy, in whole or in part at U of M's
sole discretion, as part of services offered in
cooperation with partner research libraries
such as the institutions in the Digital Library
Federation…”
History
• Collective Agreement with CIC Announced in
June 2007
– U of Michigan and U of Wisconsin Projects already
underway
History
• In 2007, CIC agreed to establish a shared
digital repository
• University of Michigan and Indiana University
initial leaders of this effort
History
CIC Shared
Digital
Repository
HathiTrust
The Partners
• When announced in October 2008, partners
included:
– University of California system
– CIC (Committee on Institutional Cooperation)
University of Chicago
University of Illinois
Indiana University
University of Iowa
University of Michigan
Michigan State University
– University of Virginia
University of Minnesota
Northwestern University
Ohio State University
Pennsylvania State University
Purdue University
University of Wisconsin-Madison
Columbia University
The Name
• The meaning behind the name
– Hathi (hah-tee)--Hindi for elephant
– Big, strong
– Never forgets, wise
– Secure
– Trustworthy
Content Distribution
5,317,545 - Total
764,103 - Public Domain
Content Growth
What we do
Services
• Bit-level
preservation and
migration
• Rights database
• Copyright review
Long-term
preservation
Rights
management
• Inbound
validation
• Fixity checks
Google ingest
• Viewing
• Redistribution
• Print disabilities
• Section 108
Access (within
bounds of law
and settlement)
• Temporary catalog
• Version 1
permanent
catalog April 2010
Bibliographic
search
• Collection Builder
• Metadata files
• Bib API
• Data API
Publish virtual
collections
Availability of
data
• November 2009
• UM public domain
• UM Press
Full-text search
Print on Demand
Functional Objectives
•Improved
performance
•Right now at UM only
•Plans to extend
PageTurner
Access for users
with print
disabilities
•Metadata files
•Bib API
•Data API
•Collection Builder
Publish virtual
collections
•Identification at
partner institutions
Branding
•IA-digitized
•locally-digitized
•Full-PDF download
•Collection Builder
•Section 108 (later on)
•Users with print
disabilities (later on)
•Index optimization
•Ongoing hardware
acquisition
Non-Google
digitized print
content
Extending
services through
Shibboleth
Improvements to
large-scale
search
•Beginning to
investigate ePub as a
delivery format
•Isilon software
•June 2010
•Including outstanding
areas like disaster
recover
•Ongoing basis
•PageTurner
•Advanced search
•Search facets
•Collection Builder
Born-digital
Fixity checking
Compliance with
TRAC
Collaborative
Development
Environment
Strategies for
Openness
•Temporary catalog
•Version 1 permanent
catalog April 2010
Public discovery
interface
•Audio pilot
•Images (maps)
Non-book/nonjournal content
•Research Center
•Data distribution
•Tools such as SEASR
Data mining
tools
Governance
Governance
Budget/Finances
Decision-making
Policy
Planning
Strategic
Advisory
Board
Executive
Committee
HathiTrust
Executive Committee
•
•
•
•
•
•
•
•
•
Paul Courant, University Librarian and Dean of Libraries, UM
Laine Farley, Executive Director, CDL
John King, Vice Provost for Academic Information, UM
Paula Kaufman, University Librarian and Dean of Libraries, UI
Brian Schottlaender, University Librarian, UCSD
Ed Van Gemert, Director of Libraries, UW - Madison
Brenda Johnson, Dean of Libraries, IU
Brad Wheeler, Chief Information Officer, IU
John Wilkin, Executive Director of HathiTrust and
Associate University Library, LIT, UM
Strategic Advisory Board
• Ed Van Gemert (Chair), Director of Libraries, UW - Madison
• John Butler, Associate University Librarian for Information
Technology, U Minn
• Patricia Cruse, Director, Preservation, CDL
• Bernie Hurley, Director, Library Technologies, UC Berkeley
• R. Bruce Miller, University Librarian, UC - Merced
• Sarah Pritchard, University Librarian, Northwestern
• Paul Soderdahl, Director, LIT, U Iowa
• John Wilkin, Executive Director, HathiTrust (ex officio)
Partnership &
Resources
Partnership & Resources (1)
• Funded for a initial 5 years with
base-funding from partners
• Budget – separately held within
UMich budget system, managed
by the Executive Committee
• Cost Model – Per GB cost of storage per year with a
one-time fee on new content to build a capital fund
• Review in 3rd yr of each 5 yr period
Partnership & Resources (2)
• Staff/Expertise – highly integrated
– Project managers, IT and communications
staff, copyright experts, administrators (UM,
Indiana and UC taking the lead)
• Working groups
• UM recently hired a Digital Preservation Librarian
• Shared development space
Governance
Budget, Finances
Decision-making
Policy
Enterprise
Management
Repository
Administration
Repository
Administration
Communication
and Coordination
with partner
institutions
Hardware
configuration and
maintenance
Data management
(content storage,
backup, integrity
checks, deletion)
Project
management
Planning
Web and
application server
configuration and
maintenance
Security
Hardware selection
and replacement
Content and
Metadata
specifications
Permissions
Rights
Management
Bibliographic
Data
Management
Copyright
determination
Entity description
(record-level)
Copyright review
Object
identification
(item-level)
Copyright
information
management
(database)
Data availability
Collection
Development
Digital
• Expansion beyond
books and journals
(born-digital,
images and maps,
audio)
• Selection of
content (for nonGoogle volume
ingest and pilots
projects)
Print
• Cloud Library (effect
of digital on print)
Rightsholder
permissions
Disaster Recovery
Logging
Processes for
ensuring content
integrity
e-Commerce
Print on Demand
Content Ingest
Content Access
Quality
Assurance
User Services
Transformation
PageTurner
Quality Review
Usability
Validation
Collection Builder
Content
Certification
User support
(helpdesk)
Large-scale Search
Financial
contributions
of partners
Research Center
Bibliographic
Catalog
APIs
Outreach
Project website
Monthly
newsletter
Papers and
presentations
HathiTrust Functional
Framework
Communication
with potential
partners
Surveys, general
inquiries
Repository
evaluation and
audit (e.g.,
DRAMBORA,
TRAC)
Legal
Risk management
(use of materials)
Partner
agreements
Advocacy
Partnership & Resources (3)
• Toward a Cloud Library
– CLIR, Mellon Foundation
– OCLC Research, NYU, HathiTrust, Recap Libraries
• Objective: Characterize the near-term opportunity for externalizing
management of academic research collections leveraging capacity
of large-scale shared print and digital repositories*
• Outcomes: opportunity and risk assessment based on aggregate
collection analysis; draft service agreement enabling generic
consumer library to selectively outsource preservation and access
of low-use research collections to large-scale print and digital
repositories
*From the RLG Partner Update January 7, 2010
Partnership & Resources (4)
• CRL TRAC Audit
– Portico and HathiTrust assessments timely
– “Certification will augment CRL’s strategic archiving of
print, and support a responsible transition to electroniconly formats where appropriate.”
– Work with UC to design shared print journal archiving
effort
– “With this hybrid strategy CRL hopes to enable its
community to accelerate the shift to electronic-only
resources in a careful and responsible manner.”
* http://www.crl.edu/archiving-preservation/digitalarchives/certification-and-assessment-digital-repositories
Partnership & Resources (5)
• New cost model
• Based on benefits to institutions
– Public Domain
– In-copyright
• Volumes “held”
• Covered by Settlement
– Print replacement, users with print disabilities; research
corpus
• Not
– Section 108; expand via authentication
Partnership & Resources (6)
• Timeline:
– Implement in 2013
– Accept new partners now with costs based on
overlap calculations
• Requirements:
– Print holdings database
– Update mechanisms
– Manual remediation
Partnership & Resources (7)
• Print holdings database will also benefit
– De-duplication
• Compromises user experience, obscures collection
development needs
– Management of print volumes
• Information to withdraw volumes (journals)
– Legal uses of copyright materials
• Section 108, 121, ADA uses will depend knowledge of
which institutions own(ed) which mate
Technology
Technology - OAIS
MARC record extensions
(Aleph)
Rights DB
GROOVE
(JHOVE)
Page Turner
HathiTrust API
OAI
GeoIP DB
CNRI Handles
[Solr]
Google
[OCA]
In-house Conversion
;
GRIN
Internal Data Loading
METS/PREMIS object
TIFF G4/JPEG2000
OCR
MD5 checksums
Isilon
Site Replication
TSM
MD5 checksum validation
METS object
PNG
OCR
PDF
Technology – Architecture
• Inbound validation, standards-based
object storage and related metadata
• Storage in Ann Arbor and Indianapolis
• Encrypted backup to 3rd location
• Rights database for rights metadata
• Online catalog as source and storage for descriptive
metadata
Technology - Ingest
• Automatic validation in GROOVE
– Check barcode check digit using Luhn algorithm
– Fixity check on JPG2000, TIFF, UTF8 using MD5
– Well-formedness and embedded metadata check
on JPG2000, TIFF, UTF8 using JHOVE
Technology - Repository
• Simple filesystem layout
– One directory per volume, zip file and METS file
– Use of a namespace allows for conflicting
identifiers
– Namespaces for institutions and, if needed, types
of identifiers within the institution
Technology – METS Object
• Why METS?
– Can serve as Archival Information
Package and a Dissemination
Information Package
– Designed to record the relationship between
pieces of complex digital objects
– Can be created automatically as texts are loaded
or reloaded
– Preservation actions (PREMIS)
Technology – METS Object
• What’s there?
– metsHdr with an ID and CREATEDATE
– 2 dmdSecs: Marcxml and mdRef
– amdSec containing one techMD with PREMIS
metadata
– fileSec with 4 fileGrps (zip, images, OCR, hOCR)
– Physical structMap tying together files with
metadata (pg. numbers and features)
Future Directions
Future Directions
•Partner Institutions
•Partner Institutions
•SAB working group
•SAB working group)
•SAB working group
•SAB
Usage reporting
Holdings
database
Quality
De-duplication
OCLC catalog
3-year review
•Research Center
•Data distribution
•Tools such as SEASR
•Wisconsin
•University of California
•University of California
•Full-PDF download
•Collection Builder
•Section 108 (later on)
•Users with print
disabilities (later on)
Data mining tools
Ingest reporting
New bibliographic
management
Content
validation
Extending
services through
Shibboleth
•Data API
•IA-digitized
•locally-digitized
•Isilon software
•June 2010
•Including outstanding
areas like disaster
recover
•ongoing areas
•PageTurner
•Advanced search
•Search facets
•Collection Builder
Non-Google
digitized print
content
Fixity checking
Compliance with
TRAC
Collaborative
Development
Environment
•UC and GnuBook
•Partner Institutions
Improvements to
PageTurner
•Beginning to investigate
ePub as a delivery
format
Born-digital
Strategies for
Openness
•CB Integration
•Advanced search/facets
•Index optimization
•Ongoing hardware
acquisition
Improvements to
Large-scale
Search
•Audio pilot
•Images (maps)
Non-book/nonjournal content
•NSF EAGER
•Mellon Quality
Grant projects
Thank You!
jjyork@umich.edu
http://www.hathitrust.org
Download