The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Digital Library John Unsworth – Brandeis University Room 421 Snell Library Northeastern University DPLAFest 2013 11:00-12:30 October 25, 2013 HathiTrust Partnership Allegheny College Arizona State University Baylor University Boston College Boston University Brandeis University California Digital Library Carnegie Mellon University Columbia University Cornell University Dartmouth College Duke University Emory University Florida State University Getty Research Institute Harvard University Library Indiana University Iowa State University Johns Hopkins University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University New York Public Library New York University North Carolina Central University Tweet Us: #HTRC #SESS037 #EDU13 North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Syracuse University Texas A&M University Tufts University Universidad Complutense de Madrid University of Alabama University of Arizona University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Connecticut University of Delaware University of Florida University of Illinois at Chicago The University of Iowa University of Maryland University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahama University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Virginia Tech Wake Forest University Washington University Yale University Library http://www.hathitrust.org/htrc HathiTrust Mission To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc HathiTrust Services • Long-term preservation – Bit-level and migration • • • • • • • Bibliographic search Full-text search Reading and download capabilities Print on demand Collections Datasets HathiTrust Research Center Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc HathiTrust By The Numbers • • • • • • • • 10,819,596 total volumes 5,672,046 book titles 281,890 serial titles 3,786,858,600 pages 485 terabytes 128 miles 8,791 tons 3,469,225 volumes(~32% of total) in the public domain Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Discovery and Use • Search, collections, online access • APIs and data feeds – Data API – Bibliographic API – “Hathifiles” inventory files – OAI • Computational Research – Distribution of datasets – Protocol-based access – Research Center Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Research Center in Context Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Goals for HTRC • Provide a persistent and sustainable structure to enable scholars to ask and answer new questions. – Leverage data storage and computational infrastructure at Indiana & Illinois – Stimulate community development of new functionality and tools – Use tools to enable discoveries that would not be possible without the HTRC • Enable scholars to fully utilize content of HathiTrust Library while preventing intellectual property misuse within U.S. copyright law. – Provide a secure computational and data environment for scholars to perform research using HathiTrust Digital Library. Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc HTRC Governance • • Reports to the HathiTrust Board of Governors HTRC Executive Committee – J. Stephen Downie (Co-director), Professor and Associate Dean for Research, University of Illinois GSLIS – Beth Plale (Co-director and Chair), Director Data To Insight Center and professor in the School of Informatics and Computing at Indiana University – Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana University – Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning & Policy at the University of Illinois – John Unsworth, Vice Provost for Library & Technology Services and Chief Information Officer at Brandeis University • • HTRC Advisory Board (See members next slide) Google Public Domain agreement – in place for IU and UIUC Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc HTRC Advisory Board • • • • • • • • • • • • • • • • • Cathy Blake, University of Illinois, Urbana-Champaign Beth Cate, Indiana University Greg Crane, Tufts University Laine Farley, California Digital Library Brian Geiger, University of California at Riverside David Greenbaum, University of California at Berkeley Fotis Jannidis, University of Wurzberg, Germany Matthew Jockers, Stanford University Jim Neal, Columbia University Bill Newman, Indiana University Bethany Nowviskie, University of Virginia Andrey Rzhetsky, University of Chicago Pat Steele, University of Maryland Craig Stewart, Indiana University David Theo Goldberg, University of California at Irvine John Towns, National Center for Supercomputing Applications Madelyn Wessel, University of Virginia Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Data Overview Hathifiles • • • • Tab-delimited inventory files Aggregated monthly Daily incremental files Contain – Identifiers – Limited bibliographic information – Rights, language, gov docs status information Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Content Distribution U.S. Federal Government Documents (worldwide) 4% In-copyright or undetermined 70% "Public Domain” 30% Public Domain (worldwide) 15% Public Domain (US) 10% Open Access .1% Creative Commons .01% Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Content Sources LC 1% Minnesota 1% Yale UNC-Chapel Hill 0% Harvard Madrid Virginia 0% Utah State 1% Indiana 1% Chicago 0% 0% 2% NCSU 0% Columbia NorthwesternDuke 0% 0% 1% 0% Illinois Penn State NYPL Princeton Purdue 0% 0% 3% 3% 0% Cornell Wisconsin 4% 5% Michigan 45% California 33% Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Dates 1900-1909 4% 1910-1919 4% 1920-1929 4% 1930-1939 4% 1600-1699 0% 1800-1849 3% 1700-1799 1850-1899 1% 8% 1940-1949 4% 1500-1599 0% 0-1500 0% 2000-2009 10% 1990-1999 14% 1980-1989 15% 1960-1969 11% 1970-1979 13% 1950-1959 6% Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Language Distribution Arabic Latin 2%Italian 1% Japanese 3% Remaining Languages 14% 3% Russian 4% Chinese 4% Spanish 5% French 7% Tweet Us: #HTRC #SESS037 #EDU13 The top 10 languages make up ~86% of all content English 48% German 9% http://www.hathitrust.org/htrc Data Availability Source Data Management Access Catalog Bib Data Ingest Bibliographic Data Rights Data Holdings Data Content Package Storage Full-text Search PageTurner Collections APIs Indiana Michigan Tweet Us: #HTRC #SESS037 #EDU13 Datasets http://www.hathitrust.org/htrc How is it available? • Web interfaces • APIs – Data API – Bib API • Data feeds and distribution – Hathifiles – OAI – Datasets • Soon: Virtual Machines Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Copyright Copyright • Balancing copyright with fair use: “I cannot imagine a definition of fair use that would not encompass the transformative uses made by Defendants' MDP [Mass Digitization Project] and would require that I terminate this invaluable contribution to the progress of science and cultivation of the arts that at the same time effectuates the ideals espoused by the ADA [Americans With Disabilities Act].” –Judge Harold Baer, Jr., U.S. Southern District of New York Court, ruling in Authors Guild, Inc. et al. v. HathiTrust et al. Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Automatic Rights Determination • Conducted on all works at time of ingest and when records are modified – Public domain worldwide • US works published before 1923, US federal government publications, non-US works published prior to 1872 – Public domain in the United States • Non-US works published prior to 1923 Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Manual Rights Determination • IMLS-funded CRMS project – – – – – US-published works 1923-1963 Conformance with formalities Expanding to non-US works Double-blind review with expert review for conflicts Staff at 4 HathiTrust partner institutions (15 will take part in non-US) – As of February 2012 ~190,000 reviewed, more than 100,000 opened • Rights Holder Permissions Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Rights Attributes id name type dscr 1 pd copyright public domain 2 ic copyright in-copyright 3 opb copyright out-of-print and brittle (implies in-copyright) 4 orph copyright copyright-orphaned (implies in-copyright) 5 und copyright undetermined copyright status 6 umall access available to UM affiliates and walk-in patrons (all campuses) 7 world access available to everyone in the world 8 nobody access available to nobody; blocked for all users 9 pdus copyright public domain only when viewed in the US 10 cc-by copyright Creative Commons Attribution 11 cc-by-nd copyright Creative Commons Attribution-NoDerivatives 12 cc-by-nc-nd copyright Creative Commons Attribution-NonCommercial-NoDerivatives 13 cc-by-nc Creative Commons Attribution-NonCommercial 14 cc-by-nc-sa copyright Creative Commons Attribution-NonCommercial-ShareAlike 15 cc-by-sa copyright Creative Commons Attribution-ShareAlike 16 orphcand copyright orphan candidate - in 90-day holding period (implies in-copyright) 17 cc-zero copyright Creative Commons Zero license (implies pd) 18 und-world copyright Undetermined copyright status and permitted as world-viewable by the depositor 19 Ic-us copyright In copyright in the US copyright Rights Determination Reason Codes id 1 2 3 4 5 6 7 8 name bib ncn con ddd man pvt ren nfi dscr bibliographically-derived by automatic processes no printed copyright notice contractual agreement with copyright holder on file due diligence documentation on file manual access control override; see note for details private personal information visible copyright renewal research was conducted needs further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered) 9 cdpp 10 cip title page or verso contain copyright date and/or place of publication information not in bib record condition review and in-print status research was conducted 11 12 unp gfv unpublished work Google viewability set at VIEW_FULL 13 crms derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details 14 add author death date research was conducted or notification was received from authoritative source 15 exp expiration of copyright term for non-US work with corporate author 16 Del Deleted from repository; see note for details 17 Gatt Non-US public domain work restored to in-copyright in the US by GATT Type of work Searchable (bibliographic and full-text) Viewable* Full-PDF download (Data API) Print on Demand Print disabilities* Preservation uses (Section 108)* Public domain worldwide Worldwide Worldwide Worldwide Partners worldwide N/A Public domain (US) – Non-US works published between 1872 and 1923. Worldwide When accessed from with the United States Partners only if scanned by Google, if not, worldwide. Partners in the US if scanned by Google, if not, anyone US Works that rights holders have opened access to in HathiTrust Worldwide Worldwide Works that are in-copyright or of undetermined status Worldwide Available within Partners in the the United US; partners worldwide States where similar laws in effect N/A Worldwide (if Worldwide with Partners digitized by permission worldwide Google, full-PDF only available if opened with CC license) Partners in the Not available Not available Not available US; partners worldwide where similar laws in effect Partners in the To participating Not available Not available US partners N/A Partners in the US; partner worldwide where similar laws in effect Partners in the Orphan works Worldwide US; partners worldwide * Note: Access to in-copyright works is subject to conditions on Terms of Access slide. See here also. where similar laws in effect http://www.hathitrust.org/htrc Tweet Us: #HTRC #SESS037 #EDU13 HTRC Research Paradigm Bring the COMPUTATION to the DATA! • • • • • • • Web services architecture and protocols Registry of services and algorithms Solr full text indexes noSQL store as volume store openID authentication Portal front-end, programmatic access Data mining algorithms Portal Blacklight Agent instance Agent instance SEASR analytics service WSO2 registry services, collections, data capsule images HTRC Data API v0.1 WS02 Identity Server Agent framework Agent instance Agent instance Solr index Task deployment Meandre Orchestration Non-consumptive Data capsules NCSA local resources Volume store Volume store (Cassandra) Volume store (Cassandra) (Cassandra) rsync NSF XSEDE Big Red II/IU Quarry HathiTrust corpus Page/volume tree (file system) 30 Tweet Us: #HTRC #SESS037 #EDU13 Programmatic access e.g., University of Michigan http://www.hathitrust.org/htrc HTRC Request Spatial plots All the complexity Statistical plots Complexity hiding interface Tabular info HTRC Subsets of corpus Other data (dictionaries, wiki data) Complexity hiding interface Text mining algorithms HTRC Research Access VM Image Store VM Image Manager Request for VM Researcher VM Manager VM Image Builder Secure Virtual Cloud VM instance SSH Tweet Us: #HTRC #SESS037 #EDU13 Non-consumptive Output Storage http://www.hathitrust.org/htrc 1 Select volumes for analysis 3 View/download results Named Entities Word frequencies Tweet Us: #HTRC #SESS037 #EDU13 2 Select algorithm Topic models http://www.hathitrust.org/htrc Research Engagements Colin Allen Professor, Cognitive Science Indiana University https://inpho.cogs.indiana.edu/ 1315 volumes selected using a keyword search for ‘Darwin', ‘Romanes', 'anthropomorphism', and 'comparative psychology’. This set contains lots of books that are not of particular interest -- e.g., books on theology, college course catalogs. Challenge: Find the philosophical arguments in haystack of sentences Digging into Data 2011 Yearly values of ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899 Ted Underwood, Dept of English, UIUC http://goo.gl/hVbNfZ Phenotypes implemented at level of genes General study: understanding of how phenotypes, such as human healthy diversity and maladies, are implemented at level of genes. Why HTRC: capture properties of language automatically -- for text transformations and information extraction. Generalize grammatical and idiomatic patterns as related to systems biology. Andrey Rzhetsky Professor, Department of Medicine University of Chicago http://www.ci.uchicago.edu/research/rzhetsky/ Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Other Grants and Proposals involving HTRC • Zdenek Zdrahal, “DiscoveryCORE, Discovering Hidden Relationships in Semantically Connected Resources”, NEH Digging Into Data Challenge. • Matthew Wilken, NotreDame, “Literary Geography at Scale”, American Council of Learned Societies (ACLS). • Ichiro Fujinaga, “Single Interface for Music Score Searching and Analysis (SIMSSA)” to SSHRC, Canada. Pending. • Andrew Piper, Text Mining the Novel: Establishing the Foundations of a New Discipline, SSHRC, Canada. • Robert Liffe, University of Sussex, Textual Genomics Project (TTGP), United Kingdom Arts and Humanities Research Council. • Edie Rasmussen. From Indexer’s Legacy to Scholar’s Desktop. • Adam Farquhar, The British Library. IRIS, Arts and Humanities Research Council grant. Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Workset Creation for Scholarly Analysis Funded at $493,000 by the Andrew W. Mellon Foundation; Co-PIs: J. Stephen Downie, Tim Cole, Beth Plale; 1 July 2013 30 June 2015. Goals: 1) enriching the metadata in the HathiTrust corpus 2) augmenting string-based metadata with URIs to leverage discovery and sharing through external services, and 3) formalizing the notion of collections and worksets in the context of the HathiTrust Research Center. Includes an open, competitive Request for Proposals in November 2013, with the intent to fund four prototyping projects that will build tools for enriching and augmenting metadata for the HathiTrust corpus. Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc HTRC Sloan Cloud for Secure TextMining at Scale Funded at $606,000 by The Alfred P. Sloan Foundation; Beth Plale, Indiana University, PI; Atul Prakash, University of Michigan, Co-PI; Fall 2011 - Spring 2013. Goal: Prototype a system that enables secure text mining to be carried out at scale using public cloud resources, including: 1. a software cloud infrastructure based on OpenStack 2. mechanisms for managing a secure virtual machine We plan The Sloan Cloud will provide users with dedicated virtual machines that are pre-configured with appropriate tools and provide secure access to remote data that cannot be funneled through the VM to outside filesystems. Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Thank You • This presentation was made possible with content provided by many HTRC colleagues John Unsworth, J. Stephen Downie, Beth Plale, Robert H. McDonald, Beth Sandore, Yiming Sun, Miao Chen, Guangchen Ruan, Loretta Auvil, Kirk Hess, and many others… • The HTRC Non-Consumptive Research Grant is graciously funded by the Alfred P. Sloan Foundation • IU D2I-PTI is graciously funded by The Lilly Endowment, Inc. • HTRC - http://www.hathitrust.org/htrc • IU D2I Center - http://d2i.indiana.edu/ • UIUC GSLIS - http://www.lis.illinois.edu/ Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc Contact Information Speaker: John Unsworth, Brandeis University unsworth@brandeis.edu | @unsworth Requests for assistance: Miao Chen, HTRC Education and Outreach miaochen@indiana.edu Tweet Us: #HTRC #SESS037 #EDU13 http://www.hathitrust.org/htrc