HathiTrust: A Big Idea with Bold Plans

advertisement

HathiTrust: A Big Idea with Bold

Plans

Brenda Johnson, Dean of University Libraries

Gary Charbonneau, Systems Librarian

Julie Bobay, Associate Dean for Collection Development and Scholarly Communication

Statewide IT Conference, Indiana University

Sept. 27, 2010

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust - Outline

A Big Idea

• Mission and Goals; Partners; Governance

Content and Use

• Relationship to Google Books and Internet Archive

• Size, characteristics of content

• A few words about technology

Bold Plans

Statewide IT Conference, Indiana University

Importance of A Name

• Hathi (pronounced hah-tee)

Hindi word for elephant, an animal highly regarded for its memory, wisdom, and strength

• Trust

A core value of research libraries and one of their greatest assets. In combination, the words convey the key benefits researchers can expect from a first-of-its-kind shared digital repository

• There’s an elephant in the library.

September 27, 2010

Statewide IT Conference, Indiana University September 27, 2010

What is HathiTrust?

• Started in 2008 as a partnership among research libraries, HathiTrust is an open web resource that aggregates, preserves and provides access to the collections of member libraries.

• Initial purpose was to provide trusted shared repository for books and journals digitized by and available through Google Books and

Internet Archive

Statewide IT Conference, Indiana University September 27, 2010

Google Books/Internet Archive

• In 2004, Google began digitizing the books and journals from many major research libraries in

U.S. – including, starting in 2008, IU’s

• Some libraries, including the University of

California, had similar digitization projects with the Internet Archive

• Books and journals digitized from these projects were deposited in HathiTrust

Statewide IT Conference, Indiana University September 27, 2010

Current HathiTrust Partners:

29 and Counting

Columbia University

Dartmouth University

University of California system (11 libraries)

CIC (Committee on Institutional Cooperation) (12 libraries)

University of Chicago University of Minnesota

University of Illinois

Indiana University

University of Iowa

University of Michigan

Michigan State University

Northwestern University

Ohio State University

Pennsylvania State University

Purdue University

University of Wisconsin, Madison

New York Public Library

Princeton University

University of Virginia

Yale University

Statewide IT Conference, Indiana University September 27, 2010

If Google and Internet Archive have these books, why do we need

HathiTrust?

HathiTrust’s mission is much broader than simply to replicate Google Books:

Contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge.

Statewide IT Conference, Indiana University September 27, 2010

Why do we need HathiTrust? (1)

Preservation…For The Long Term

• Better entrusted to research libraries than to a private corporation, even a benevolent one

• Not just preserving bits

• Full preservation program, including active curation, metadata, migration, management plans, etc.

• Seeking TRAC Certification (Trustworthy Repository

Audit and Certification)

Statewide IT Conference, Indiana University September 27, 2010

Why do we need HathiTrust? (2)

Expanded access and discoverability

• Full-text access to pre-1923 books and journals, plus those which have had rights cleared

• Beyond full-text keyword search: enhanced discoverability options

Statewide IT Conference, Indiana University September 27, 2010

Why do we need HathiTrust? (3)

Focus on scholarly values and needs

• Develop content, access and functionality that meets needs of researchers

• Share expertise and cost of preserving and providing access to scholarly record among institutions who share this fundamental mission

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust: Getting Started

• Initial development responsibility:

University of Michigan, with mirror site at

IUPUI, administered by UITS Enterprise

Infrastructure

• Much future development will be distributed among partner institutions under direction of HathiTrust Executive

Committee

Statewide IT Conference, Indiana University September 27, 2010

A Unique Partnership

• HathiTrust is library work at scale; an early example of an

“above-campus” service

• A new experiment in collaboration

Not a separate entity; not a 501(c)(3) like Sakai, Kuali,

DuraSpace or many open source software projects

Instead, a jointly-funded, jointly governed, jointly developed partnership.

• Together, we are HathiTrust.

Statewide IT Conference, Indiana University September 27, 2010

Sustainability:

HathiTrust Governance 2008-2012

• Executive Committee

Budget, finances, decision making

• Strategic Advisory Board

Guidance on policy and planning

• HathiTrust staff

• Working groups and committees

Statewide IT Conference, Indiana University September 27, 2010

Current Working Groups

• Discovery Interface

• Collections

• Quality

• Communication

• Usability

• Storage

• Development Environment

• Research Center

Governance

Budget, Finances

Decision-making

Policy

Planning

Enterprise

Management

Communication and Coordination with partner institutions

Project management e-Commerce

Print on Demand

Content Ingest

Financial contributions of partners

Transformation

Validation

Repository

Administratio n

Hardware configuration and maintenance

Web and application server configuration and maintenance

Security

Permissions

Logging

Repository

Administratio n

Data management

(content storage, backup, integrity checks, deletion)

Hardware selection and replacement

Content and

Metadata specifications

Disaster

Recovery

Processes for ensuring content integrity

Quality

Assurance

Rights

Management

Copyright determination

Copyright review

Copyright information management

(database)

Rightsholder permissions

Bibliographic

Data

Management

Entity description

(record-level)

Object identification

(item-level)

Data availability

Content

Access

User Services Outreach

PageTurner Quality Review Usability Project website

Collection

Builder

Content

Certification

User support

(helpdesk)

Monthly newsletter

Large-scale

Search

Research Center

Bibliographic

Catalog

HathiTrust Functional

Framework

APIs

Papers and presentations

Communication with potential partners

Surveys, general inquiries

Repository evaluation and audit (e.g.,

DRAMBORA,

TRAC)

Collection

Development

Digital

• Expansion beyond books and journals

(born-digital, images and maps, audio)

• Selection of content (for non-

Google volume ingest and pilots projects)

Print

• Cloud Library

(effect of digital on print)

Legal

Risk management

(use of materials)

Partner agreements

Advocacy

Statewide IT Conference, Indiana University September 27, 2010

Next steps in governance

• 5-year agreements, reviewed in the third year of every term

• First Constitutional Convention will be in 2012

• Partners will determine governance structures and partnership models, effective 2013

Statewide IT Conference, Indiana University September 27, 2010

Focus On Users

• Preservation…with access

• Benefits to IU researchers and their colleagues around the world:

– Ensure long-term preservation and access

– Increase discoverability

– Create scholarly tools

– Expand content beyond Google and Internet

Archive

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust – constantly changing

• Rapid growth and development; fluid environment

• Next few slides describe HathiTrust currently

• Will follow with discussion about future plans

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust - Content

• The vast majority of what is currently in

HathiTrust consists of files received from Google from volumes digitized by Google for Google

Book Search

• Almost all of the remainder consists of files received from Internet Archive. Much of the content from University of California comes by way of Internet Archive

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust Content (2)

• Since not all of Google’s “library partners” are members of HathiTrust, and none of

Google’s publisher partners are,

HathiTrust is still (mostly) a subset of what is in Google Book Search.

However….

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust Content (3)

• Because of HathiTrust’s copyright clearance project, there are some things available in full text in HathiTrust that are only available in “snippet view” in Google.

• Because of Internet Archive, there are probably some things in HathiTrust that are not available in Google at all.

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust - focus on collections

• HathiTrust is about collections, not simply

Google digitization

• For example:

• access for persons with print disabilities

• opening access for public domain volumes

• collection building tool

• high-quality bibliographic data necessary for scholarly work

Statewide IT Conference, Indiana University

Content Growth

September 27, 2010

Statewide IT Conference, Indiana University

Content Distribution

September 27, 2010

Statewide IT Conference, Indiana University September 27, 2010

Language Distribution (1)

Statewide IT Conference, Indiana University September 27, 2010

Language Distribution (2)

Statewide IT Conference, Indiana University

Dates

September 27, 2010

Statewide IT Conference, Indiana University

Originating Institution

September 27, 2010

Statewide IT Conference, Indiana University

Content Over Time

September 27, 2010

Statewide IT Conference, Indiana University September 27, 2010

Statewide IT Conference, Indiana University September 27, 2010

Statewide IT Conference, Indiana University September 27, 2010

Statewide IT Conference, Indiana University September 27, 2010

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust DataGrid

• Using Isilon Clustered Storage System

• Similar principles to a datagrid using WAFS

(OneFS)

– Wide Area File System (2.3 PB per file system)

– Automated data replication among nodes

– Currently Two Nodes

• Ann Arbor - University of Michigan

• Indianapolis – Indiana University NOC

• Connected via I-Light and Michigan Lambda Rail

Statewide IT Conference, Indiana University

Indianapolis

September 27, 2010

HathiTrust Grid

Ann Arbor

Isilon OneFS Currently Supports up to 2.3 PB between Two Nodes

Statewide IT Conference, Indiana University September 27, 2010

More on HathiTrust Technology

http://www.hathitrust.org/technology

Statewide IT Conference, Indiana University September 27, 2010

A Use Case

• IUB scholar needed quick access to a definitive

52volume set of Voltaire’s work published in late 1800s; deadline approaching

• Had been transferred to the Auxiliary Library

Facility

• Available in HathiTrust and Google Books

• Google Books not usable for this scholarly purpose

• Able to do work much more efficiently and quickly in HathiTrust

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust’s Bold Plans

• We believe the HathiTrust of tomorrow will look very different from the HathiTrust of today

• Google and Internet Archive digitized volumes just the beginning

• The sky’s the limit (or, more accurately, the combined will and resources of the partnership are the limit)

Statewide IT Conference, Indiana University September 27, 2010

Vision for the future:

More Content

• Current and backlist scholarly monographs

• Born-digital materials

• Some locally-digitized collections

• Some non-book/non-journal resources

…anything that is appropriate for a research library collection AND IS A SHARED PRIORITY FOR

PARTNERS

Statewide IT Conference, Indiana University September 27, 2010

Vision for the future:

More Content (2)

• More full-text:

Google Book Settlement - if approved:

– could receive all Google-digitized files to preserve

– could make much more full-text available

• Rights-clearing project - open access to public domain materials

Statewide IT Conference, Indiana University September 27, 2010

Vision for the Future:

More Functionality

• Research tools

– Computational research

– Advanced collection builders

– Advanced discovery

• Expanded quality processes

• Rigorous preservation guarantees

• Defining paths for fair uses

• Tools for shared print collection management

Statewide IT Conference, Indiana University September 27, 2010

Vision for the Future:

Enhanced Discoverability

• Not just keyword searching of full-text

• Highly-functional bibliographic access

- HathiTrust catalog

- Integration into other discovery tools:

- IUCAT, WorldCat, Discovery Services

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust and local digital library initiatives

• HathiTrust is a solution for large-scale, shared high-priority needs of partners; currently optimized for digitized monographs and journals

• Partners will identify priorities for content and functionality development

• HathiTrust will not supplant all institutionallybased digital library initiatives

• Local digital library collections and services will still be needed

Statewide IT Conference, Indiana University September 27, 2010

How Can HathiTrust Make a

Difference?

• Future not yet known precisely, but…

• For the first time in history, HathiTrust has:

- defined a large-scale partnership to achieve a largescale goal

- built the first version of a very large, high-quality shared repository

• Building blocks to ensuring that research collections, print and digital:

• are preserved, curated, highly discoverable and accessible

• retain their research value in a digital platform

Statewide IT Conference, Indiana University September 27, 2010

Some lessons learned so far

• HathiTrust can serve as shared repository for mass digitized library collections

• HathiTrust can provide organizational structure for other collaborations

– Shared print collection management

– Bibliographic integration

• The research library community is able to collaborate deeply to attain shared goals

Statewide IT Conference, Indiana University September 27, 2010

HathiTrust Mission - redux

Contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge.

Statewide IT Conference, Indiana University September 27, 2010

Credits

Our thanks to colleagues who generously granted us permission to use their slides for this presentation:

John Wilkin, HathiTrust Executive Director

Jeremy York, HathiTrust Project Librarian

Heather Christenson, Mass Digitization Project Manager,

California Digital Library

Also, many of the ideas for this presentation based on:

Courant, Paul N. and John Wilkin. “Building ‘Above Campus’ Library

Services.” Educause Review , July/August 2010, 74-75.

Download