Brenda Johnson, Dean of University Libraries
Gary Charbonneau, Systems Librarian
Julie Bobay, Associate Dean for Collection Development and Scholarly Communication
Statewide IT Conference, Indiana University
Sept. 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
A Big Idea
• Mission and Goals; Partners; Governance
Content and Use
• Relationship to Google Books and Internet Archive
• Size, characteristics of content
• A few words about technology
Bold Plans
Statewide IT Conference, Indiana University
• Hathi (pronounced hah-tee)
Hindi word for elephant, an animal highly regarded for its memory, wisdom, and strength
• Trust
A core value of research libraries and one of their greatest assets. In combination, the words convey the key benefits researchers can expect from a first-of-its-kind shared digital repository
• There’s an elephant in the library.
September 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
• Started in 2008 as a partnership among research libraries, HathiTrust is an open web resource that aggregates, preserves and provides access to the collections of member libraries.
• Initial purpose was to provide trusted shared repository for books and journals digitized by and available through Google Books and
Internet Archive
Statewide IT Conference, Indiana University September 27, 2010
• In 2004, Google began digitizing the books and journals from many major research libraries in
U.S. – including, starting in 2008, IU’s
• Some libraries, including the University of
California, had similar digitization projects with the Internet Archive
• Books and journals digitized from these projects were deposited in HathiTrust
Statewide IT Conference, Indiana University September 27, 2010
Columbia University
Dartmouth University
University of California system (11 libraries)
CIC (Committee on Institutional Cooperation) (12 libraries)
University of Chicago University of Minnesota
University of Illinois
Indiana University
University of Iowa
University of Michigan
Michigan State University
Northwestern University
Ohio State University
Pennsylvania State University
Purdue University
University of Wisconsin, Madison
New York Public Library
Princeton University
University of Virginia
Yale University
Statewide IT Conference, Indiana University September 27, 2010
If Google and Internet Archive have these books, why do we need
HathiTrust?
HathiTrust’s mission is much broader than simply to replicate Google Books:
Contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge.
Statewide IT Conference, Indiana University September 27, 2010
Preservation…For The Long Term
• Better entrusted to research libraries than to a private corporation, even a benevolent one
• Not just preserving bits
• Full preservation program, including active curation, metadata, migration, management plans, etc.
• Seeking TRAC Certification (Trustworthy Repository
Audit and Certification)
Statewide IT Conference, Indiana University September 27, 2010
Expanded access and discoverability
• Full-text access to pre-1923 books and journals, plus those which have had rights cleared
• Beyond full-text keyword search: enhanced discoverability options
Statewide IT Conference, Indiana University September 27, 2010
Focus on scholarly values and needs
• Develop content, access and functionality that meets needs of researchers
• Share expertise and cost of preserving and providing access to scholarly record among institutions who share this fundamental mission
Statewide IT Conference, Indiana University September 27, 2010
• Initial development responsibility:
University of Michigan, with mirror site at
IUPUI, administered by UITS Enterprise
Infrastructure
• Much future development will be distributed among partner institutions under direction of HathiTrust Executive
Committee
Statewide IT Conference, Indiana University September 27, 2010
• HathiTrust is library work at scale; an early example of an
“above-campus” service
• A new experiment in collaboration
Not a separate entity; not a 501(c)(3) like Sakai, Kuali,
DuraSpace or many open source software projects
Instead, a jointly-funded, jointly governed, jointly developed partnership.
• Together, we are HathiTrust.
Statewide IT Conference, Indiana University September 27, 2010
• Executive Committee
Budget, finances, decision making
• Strategic Advisory Board
Guidance on policy and planning
• HathiTrust staff
• Working groups and committees
Statewide IT Conference, Indiana University September 27, 2010
• Discovery Interface
• Collections
• Quality
• Communication
• Usability
• Storage
• Development Environment
• Research Center
Governance
Budget, Finances
Decision-making
Policy
Planning
Enterprise
Management
Communication and Coordination with partner institutions
Project management e-Commerce
Print on Demand
Content Ingest
Financial contributions of partners
Transformation
Validation
Repository
Administratio n
Hardware configuration and maintenance
Web and application server configuration and maintenance
Security
Permissions
Logging
Repository
Administratio n
Data management
(content storage, backup, integrity checks, deletion)
Hardware selection and replacement
Content and
Metadata specifications
Disaster
Recovery
Processes for ensuring content integrity
Quality
Assurance
Rights
Management
Copyright determination
Copyright review
Copyright information management
(database)
Rightsholder permissions
Bibliographic
Data
Management
Entity description
(record-level)
Object identification
(item-level)
Data availability
Content
Access
User Services Outreach
PageTurner Quality Review Usability Project website
Collection
Builder
Content
Certification
User support
(helpdesk)
Monthly newsletter
Large-scale
Search
Research Center
Bibliographic
Catalog
HathiTrust Functional
Framework
APIs
Papers and presentations
Communication with potential partners
Surveys, general inquiries
Repository evaluation and audit (e.g.,
DRAMBORA,
TRAC)
Collection
Development
Digital
• Expansion beyond books and journals
(born-digital, images and maps, audio)
• Selection of content (for non-
Google volume ingest and pilots projects)
• Cloud Library
(effect of digital on print)
Legal
Risk management
(use of materials)
Partner agreements
Advocacy
Statewide IT Conference, Indiana University September 27, 2010
• 5-year agreements, reviewed in the third year of every term
• First Constitutional Convention will be in 2012
• Partners will determine governance structures and partnership models, effective 2013
Statewide IT Conference, Indiana University September 27, 2010
• Preservation…with access
• Benefits to IU researchers and their colleagues around the world:
– Ensure long-term preservation and access
– Increase discoverability
– Create scholarly tools
– Expand content beyond Google and Internet
Archive
Statewide IT Conference, Indiana University September 27, 2010
• Rapid growth and development; fluid environment
• Next few slides describe HathiTrust currently
• Will follow with discussion about future plans
Statewide IT Conference, Indiana University September 27, 2010
• The vast majority of what is currently in
HathiTrust consists of files received from Google from volumes digitized by Google for Google
Book Search
• Almost all of the remainder consists of files received from Internet Archive. Much of the content from University of California comes by way of Internet Archive
Statewide IT Conference, Indiana University September 27, 2010
• Since not all of Google’s “library partners” are members of HathiTrust, and none of
Google’s publisher partners are,
HathiTrust is still (mostly) a subset of what is in Google Book Search.
However….
Statewide IT Conference, Indiana University September 27, 2010
• Because of HathiTrust’s copyright clearance project, there are some things available in full text in HathiTrust that are only available in “snippet view” in Google.
• Because of Internet Archive, there are probably some things in HathiTrust that are not available in Google at all.
Statewide IT Conference, Indiana University September 27, 2010
• HathiTrust is about collections, not simply
Google digitization
• For example:
• access for persons with print disabilities
• opening access for public domain volumes
• collection building tool
• high-quality bibliographic data necessary for scholarly work
Statewide IT Conference, Indiana University
September 27, 2010
Statewide IT Conference, Indiana University
September 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
Statewide IT Conference, Indiana University
September 27, 2010
Statewide IT Conference, Indiana University
September 27, 2010
Statewide IT Conference, Indiana University
September 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
Statewide IT Conference, Indiana University September 27, 2010
• Using Isilon Clustered Storage System
• Similar principles to a datagrid using WAFS
(OneFS)
– Wide Area File System (2.3 PB per file system)
– Automated data replication among nodes
– Currently Two Nodes
• Ann Arbor - University of Michigan
• Indianapolis – Indiana University NOC
• Connected via I-Light and Michigan Lambda Rail
Statewide IT Conference, Indiana University
Indianapolis
September 27, 2010
Ann Arbor
Isilon OneFS Currently Supports up to 2.3 PB between Two Nodes
Statewide IT Conference, Indiana University September 27, 2010
http://www.hathitrust.org/technology
Statewide IT Conference, Indiana University September 27, 2010
• IUB scholar needed quick access to a definitive
52volume set of Voltaire’s work published in late 1800s; deadline approaching
• Had been transferred to the Auxiliary Library
Facility
• Available in HathiTrust and Google Books
• Google Books not usable for this scholarly purpose
• Able to do work much more efficiently and quickly in HathiTrust
Statewide IT Conference, Indiana University September 27, 2010
• We believe the HathiTrust of tomorrow will look very different from the HathiTrust of today
• Google and Internet Archive digitized volumes just the beginning
• The sky’s the limit (or, more accurately, the combined will and resources of the partnership are the limit)
Statewide IT Conference, Indiana University September 27, 2010
• Current and backlist scholarly monographs
• Born-digital materials
• Some locally-digitized collections
• Some non-book/non-journal resources
…anything that is appropriate for a research library collection AND IS A SHARED PRIORITY FOR
PARTNERS
Statewide IT Conference, Indiana University September 27, 2010
• More full-text:
Google Book Settlement - if approved:
– could receive all Google-digitized files to preserve
– could make much more full-text available
• Rights-clearing project - open access to public domain materials
Statewide IT Conference, Indiana University September 27, 2010
• Research tools
– Computational research
– Advanced collection builders
– Advanced discovery
• Expanded quality processes
• Rigorous preservation guarantees
• Defining paths for fair uses
• Tools for shared print collection management
Statewide IT Conference, Indiana University September 27, 2010
• Not just keyword searching of full-text
• Highly-functional bibliographic access
- HathiTrust catalog
- Integration into other discovery tools:
- IUCAT, WorldCat, Discovery Services
Statewide IT Conference, Indiana University September 27, 2010
• HathiTrust is a solution for large-scale, shared high-priority needs of partners; currently optimized for digitized monographs and journals
• Partners will identify priorities for content and functionality development
• HathiTrust will not supplant all institutionallybased digital library initiatives
• Local digital library collections and services will still be needed
Statewide IT Conference, Indiana University September 27, 2010
• Future not yet known precisely, but…
• For the first time in history, HathiTrust has:
- defined a large-scale partnership to achieve a largescale goal
- built the first version of a very large, high-quality shared repository
• Building blocks to ensuring that research collections, print and digital:
• are preserved, curated, highly discoverable and accessible
• retain their research value in a digital platform
Statewide IT Conference, Indiana University September 27, 2010
• HathiTrust can serve as shared repository for mass digitized library collections
• HathiTrust can provide organizational structure for other collaborations
– Shared print collection management
– Bibliographic integration
• The research library community is able to collaborate deeply to attain shared goals
Statewide IT Conference, Indiana University September 27, 2010
Contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge.
Statewide IT Conference, Indiana University September 27, 2010
Our thanks to colleagues who generously granted us permission to use their slides for this presentation:
John Wilkin, HathiTrust Executive Director
Jeremy York, HathiTrust Project Librarian
Heather Christenson, Mass Digitization Project Manager,
California Digital Library
Also, many of the ideas for this presentation based on:
Courant, Paul N. and John Wilkin. “Building ‘Above Campus’ Library
Services.” Educause Review , July/August 2010, 74-75.