Manuscript of paper: Journal of Electronic Publishing, January 2008 In April 2007, the US National Science Foundation (NSF) and the British Joint Information Systems Committee (JISC) held an invitational workshop on datadriven science and data-driven scholarship, co-chaired by Ronald Larsen and William Arms, who jointly authored the final report. The report used the term cyberscholarship to describe new forms of research that become possible when high performance computing meets digital libraries [1]. Elsewhere in this issue of the Journal of Electronic Publishing, Ronald Larsen describes the workshop and its conclusions. In this article, William Arms gives a personal view of the motivation behind the workshop and the roles of libraries and publishing in achieving its goals. Cyberscholarship: High Performance Computing meets Digital Libraries William Y. Arms November 27, 2007 In a recent seminar, Gregory Crane of Tufts University made the simple but profound statement, "When collections get large, only the computer reads every word" [2]. Thomas Jefferson's library, which formed the nucleus of the Library of Congress, had 6,487 volumes – a large private collection in its day [3]. Edward Gibbon's famous library catalog had 2,675 records (which he wrote on the back of playing cards) [4]. Avid readers, like Jefferson and Gibbon, can read and digest thousands of books, but modern research libraries are so large that nobody can read everything in them. The marriage of high performance computing with digital libraries introduces a completely new notion of scale. High performance computing can bring together vast quantities of material – datasets, manuscripts, reports, etc. – which might never make their way into a traditional library. A scholar reads only a few hundreds of documents; a supercomputer can analyze millions. Of course, a person has a rich understanding of what is being read while a computer works at a very superficial level, but profound research is possible by simple analysis of huge amounts of information. Computer programs can identify latent patterns of information or relationships that will never be found by human searching and browsing. This new form of research can be called cyberscholarship. Cyberscholarship becomes possible only when there are digital libraries with extensive collections in a broad domain. Such collections are large even by supercomputing standards. For instance, the USC Shoah Foundation's video collection of 52,000 interviews with survivors of genocide, notably the Nazi holocaust, is about 400 terabytes [5]. The Library of Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) supports collaborative projects to preserve digital content [6]. There is no published estimate for the aggregate size of these projects, but a rough estimate is at least one petabyte or 1,000 terabytes. Many research libraries have begun to digitize the books in their collections, working with organizations such as the —1— Manuscript of paper: Journal of Electronic Publishing, January 2008 Open Content Alliance, Google, or Microsoft. These collections are becoming very large. Digitized copies of the eight million books in the Cornell library alone is estimated to be four petabytes of data. A plausible view of today's developments sees these digital collections as the future of research libraries. Already, many scientists never enter a library building or read the printed copy of a journal. Their reports, working papers, conference proceedings, datasets, images, audio, videos, software, and journal articles are created in digital formats and accessed via networks, and a large proportion of older content has been digitized. With books available in digital formats – either from the publishers or by scanning older materials – scholars from other disciplines, including many social scientists and humanities scholars, will also be able to work entirely from digital materials. Perhaps the university library will cease to be the largest building on campus and become the largest computing center. Examples of cyberscholarship How will scholars use these huge collections? How can high-performance computing support novel research on large bodies of text, data, and other digital materials? New forms of research need new types of libraries. To understand why, here are three examples from different domains. The first looks at data-driven science, the second has a focus on information discovery from heterogeneous sources, and the third is our own work at Cornell on mining the history of the Web. The National Virtual Observatory The National Virtual Observatory has been a pioneer in data-driven science [7]. Its goal is to bring together previously disjoint sets of astronomical data, in particular digital sky surveys that have made observations at various wavelengths. Important astronomical results that are not observable in a single dataset can be revealed by combined analysis of data from these different surveys. The datasets are archived at many locations and therefore the National Virtual Observatory provides coordinated access to distributed data. Because the analyses are complex and often computationally intensive, the virtual observatory expects researchers to select data extracts from the archives and download them to their own computers. For this purpose, the virtual observatory has developed an applications programming interface (API), an XML encoding scheme for astronomical data, and a number of applications that work with the API. Protocols are provided for specific tasks, such as a simple image access protocol. The National Virtual Observatory has been funded by a major NSF grant to astronomers Alex Szalay and Roy Williams. They worked closely with the late Jim Gray, a leading computer scientist who specialized in very large databases. Entrez —2— Manuscript of paper: Journal of Electronic Publishing, January 2008 The Entrez system from the National Center for Biotechnology Information (NCBI) is another early example of cyberscholarship [8]. NCBI is a division of the National Library of Medicine within the National Institutes of Health (NIH). Entrez provides a unified view of biomedical information from a wide variety of sources including the PubMed citations and abstracts, the Medical Subject Headings, full text of journal articles and books, databases such as the protein sequence database and Genbank, and computer programs such as the Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences. Entrez provides very powerful searching across the 23 databases in the system. For example, somebody who is interested in a particular genetic sequence can extract every occurrence of that sequence from all the information sources. Genetic sequences are particularly convenient for information discovery because they are encoded by four letters, A, C, G, and T, representing the four nucleotides of a DNA strand, but Entrez is more than a cross-domain search service. It provides an applications programming interface (API), so that researchers can use their own computers to explore this information, and the Entrez Programming Utilities, which provide a structured interface to all the databases. Thus researchers have a flexible set of tools to discover unexpected patterns that are buried in the datasets. The Cornell Web Lab The final example is the Cornell Web Lab [9]. For many years, a group of us at Cornell University have been interested in analyzing the Web collection of the Internet Archive [10]. Approximately every two months since 1996, the Internet Archive, led by Brewster Kahle, has collected a snapshot of the Web and preserved it for future generations. The collection has more than 110 billion Web pages, or about 1.9 petabytes of compressed data. The Wayback Machine provides basic access to the collection. To use it, a user submits a URL and receives a list of all the dates for which there are pages with that URL. If the user clicks on a date, the page is returned in a form as close to the original as possible. This is a wonderful system, but it only goes so far. It is organized for human retrieval of single pages, not for computer programs that analyze millions. Researchers want more flexibility. They want to ask questions such as, "What foreign government sites link to www.law.cornell.edu?" or "As the Web grows, how has its structure changed?" or "What was the origin of the myth that the Internet was doubling every six months?" To answer such questions, researchers need many more ways to select pages than by URL and date, as provided by the Wayback Machine, or by textual queries, as provided by search engines such as Google. They also need powerful tools to transform raw data into the form that is used for analysis. A colleague whose research is on Web clustering algorithms mentioned that his team spends 90 percent of its effort obtaining test data, and 10 percent using that data for research. Much of the preliminary effort is spent finding a suitable set of pages to study; the remaining effort is in preparing the data for analysis: removing duplicates, clarifying ambiguous URLs, extracting terms and links, matching —3— Manuscript of paper: Journal of Electronic Publishing, January 2008 links to URLs, reconciling broken links or missing data, and much more. Since the test collections are very large, these preparatory tasks and the subsequent analysis require sophisticated programming and large amounts of computation. The Web Lab's strategy to meet such needs is to copy of a large portion of the Internet Archive's Web collection to Cornell, mount it on a powerful computer system, organize it so that researchers have great flexibility in how they use it, and provide tools and services that minimize the effort required to use the data in research. Implications for publishers and libraries If these three examples are indeed representative of a new form of scholarship, the implications for publishers and libraries are profound, but there is limited experience to guide the developments. Nobody is sure what services will be needed or how they will be provided. Often it is easier to state the challenges than to articulate the solutions. Here are some of the lessons that we have learned, based primarily on our experience in developing the Web Lab. Market research and incremental development The first step in building the Web Lab was market research. In fall 2004, a group of Cornell students interviewed 15 faculty members and graduate students from social sciences and computer science. From these interviews, we gained an appreciation of potential areas of research and some of the methodologies that might be used by different disciplines. Computer scientists study the structure and evolution of the Web itself. A series of classic experiments have explored the structure of the Web, but the experiments require so much data that many have never been repeated; at most they have been repeated a few times and rarely at different dates in the evolution of the Web. Social scientists study the Web as a social phenomenon of interest in its own right and for the evidence it provides of current social events, such as the spread of urban legends and the development of legal concepts across time. In one experiment, computer scientists with strong ties to sociology analyzed a billion items of data from LiveJournal, to see how the tendency of an individual to join a community is influenced by the number of friends within the community and by how those friends are connected to each other [11]. —4— Manuscript of paper: Journal of Electronic Publishing, January 2008 Figure 1: The research dialog Market research gives general guidance, but the detailed requirements can only be developed incrementally. The Web Lab's strategy is to build a few generic tools and services, work with researchers to discover what they need, and implement new services steadily over time. Figure 1 shows the dialog between researchers and computer scientists. The dialog begins with the researchers, e.g., sociologists studying the diffusion of ideas. They have general ideas of studies they would like to do, based on concepts that are rooted in their discipline. Those of us who are building the lab may not know how to create the precise data structures, algorithms, and computing systems that they would like to have, but we can suggest alternative approaches. These suggestions may stimulate the researchers to devise new methods. Eventually, an experiment emerges that is interesting from the discipline point of view and computationally feasible. The experiment is carried out, both parties learn more, and the dialog continues. Organizational implications This incremental development requires a flexible organization and close collaboration between researchers, publishers, and librarians. Each of the three examples – the National Virtual Observatory, Entrez, and the Web Lab – has a large and complex computer systems under the control of one organization. In theory it is possible to provide comparable functionality in systems with distributed management, but, even when the data and services are distributed, the practical challenges of coordination are so —5— Manuscript of paper: Journal of Electronic Publishing, January 2008 great that it is wiser to build a centralized organization specializing in an area of cyberscholarship and serving all universities. The NSF has recently announced a program to create several centers, known as Datanets, where the collection and preservation of scientific data is closely integrated with the research that will make use of it. From an earlier generation of libraries, OCLC provides an example of a centralized organization that manages data on behalf of a large number of members. Content Cyberscholarship requires that the data exists and is accessible. One reason that the NCBI and the National Virtual Observatory were able to make rapid progress is that molecular biology and astronomy have traditions of early and complete publication of research data, and well-managed archives to look after it. For the Web Lab, the data is already collected and preserved by an established organization, the Internet Archive; our role is to provide access for researchers. In other fields, however, the information is widely scattered, or never published; experimental data is poorly archived or even discarded. In a well-known example, scholars were denied access to large portions of the Dead Sea Scrolls from the 1950s, when they were discovered, until 1991. Once captured, access to the information in machine-readable formats is crucial. All the information in a field must be available to the cyber scholar under reasonable terms and conditions. Open access is ideal but not the only possibility. Policy issues are closely tied to access. Even though the information in the Web Lab was originally placed on the Web with open access, there are important policy issues in using it for research. The copyright issues are well known and outside the scope of this article. Privacy considerations may be more daunting. Data mining can violate privacy by bringing together facts about an individual. Universities have strict policies on experiments that involve human subjects and Cornell applies these policies to data mining of the Web, but the situation becomes complicated when a digital library provides services to many universities or to non-academic researchers. Finally, there is the question of custodianship. These large digital library collections are important historical records. The custodial duty is to preserve both the content and the identifying metadata. It would be highly improper to destroy inconvenient data, or to index it in such a manner that damaging information is never found. Tools and services Libraries provide tools and services to make the collections useful to their patrons, e.g., catalogs, bibliographies, classification schemes, abstracting and indexing services, and knowledgeable reference librarians. These services have evolved over many years, based on a collective view of how libraries are used. Conversely, the services that are available have shaped the type of research that is carried out in libraries. What tools and services are required for cyberscholarship? As the Web Lab develops, we continually see new needs. An early request was for focused Web crawling, which is —6— Manuscript of paper: Journal of Electronic Publishing, January 2008 used to find Web pages fitting specified criteria. The request came from work in educational digital libraries to automate the selection of materials for a library [12]. Other requests came from a study of how opinions about a company change across time. One hypothesis is that anchor text (the linked text in Web pages) reflects such opinions. An informal description of the experiment is to begin with the Web in 2005 and extract the anchor text from all external links that point to the company's Web site. The text is processed to eliminate navigational terms, such as "click here", or "back". The frequency with which each word appears is counted and the process is repeated with data from subsequent years. Trends in the frequencies provide evidence about how the company is perceived. This simple example uses several tools that process massive quantities of data. The first is a way to select pages by attributes such as a domain name, format type, date, anchor text, and links to and from each page. For this purpose, the Web Lab has created a relational database with tables for all pages, URLs, and links. Currently it contains metadata of the pages from four complete crawls and is about 50 terabytes in size. Other tools are used to extract a clean set of links from a collection of pages and remove duplicates. This is a messy task: many of the URLs embedded in Web pages refer to pages that were never collected or never existed; URLs have many variations; because each snapshot of the Web was collected over a period of weeks or months a URL may refer to a page that changed before it was collected. Some tools and services are specific to a field of research, but there are two requirements that apply to every domain. The first is an application program interface (API), so that computer programs can interact directly with the collections. Conventional digital libraries have been developed on the assumption that the user is a person. Much effort is placed in the design of user interfaces, but the computer interface is usually given low priority. In contrast, the three examples in this article all assume that the primary form of access will be via computer programs. Researchers want computer programs to act as their agents, searching billions of items for patterns of information that are only vaguely guessed at. The next requirement is to be able to identify and download sub-collections. Often, the scale of the full collection inhibits complex analyzes. Instead, the methodology is to select part of the collection and download it to another computer for analysis. For this purpose, digital library collections must be designed so that programs can extract large parts of them. For example, Wikipedia encourages researchers to download everything: the software, the content, the discussions, and all the historical trails. Currently, it is about one terabyte, a size that can be analyzed on a smallish computer. Using high-performance computing Very large collections require high-performance computing. Few scholars would recognize the Cornell Web Lab as a library, but the computer system looks very familiar to the supercomputing community. An early version is described in [13]. Powerful computer systems at the Internet Archive in San Francisco and Cornell University in —7— Manuscript of paper: Journal of Electronic Publishing, January 2008 upstate New York are connected by research networks – Internet2 and the National LambdaRail – which have so much capacity that complete snapshots of the Web are routinely transferred across the country. The Internet Archive operates very large clusters of Linux computers, while Cornell has several smaller computer clusters and a large database server dedicated to the Web Lab. One of the most interesting challenges is to help researchers who are not computing specialists use these systems. Until very recently, high-performance computing was a very specialized area. Even today, experienced computing professional are often discouraged by the complexities of parallel programming and managing huge datasets. Fortunately this situation is changing. In the 1960s, the Fortran programming language first gave scientists a simple way to translate their mathematical problems into efficient computer codes. Cyber scholars need a similarly simple way to express massive data analysis as programs that run efficiently on large computer clusters. The map/reduce programming paradigm developed by the Lisp community and refined and made popular by Google appears to fulfill this need [14]. There is an excellent, open-source software system called Hadoop that combines map/reduce programming with a distributed file system [15]. In the Web Lab, we have used it for tasks such as full text indexing, the anchor text study described above, and extraction of the graph of the links between Web pages. While we are still inexperienced in running large jobs, the programming required of the researchers is reassuringly straightforward. Computational limits Modern computers are so powerful that it is easy to forget that they have limits, but techniques that are excellent for moderately large digital libraries fail with very large collections. The Web Lab tries to isolate cyber scholars from the complex systems that are serving them, but they have to recognize that, with billions of records and hundreds of terabytes of data, apparently simple operations can run forever. We need better ways to help researchers recognize these limits, design experiments that do not exceed them, and off-load appropriate tasks to more powerful computers. For instance, users of the Web Lab want full-text indexes of the collections that they are studying. The Internet Archive and Cornell University have considerable computing resources, but not enough to index a hundred billion Web pages. Fortunately, no researcher wants to analyze every page in the history of the entire Web. In one set of experiments, researchers are analyzing a sub-collection of 30 million pages extracted from the amazon.com domain. This collection would be considered large in other contexts, but there is little difficulty in indexing it on a medium sized computer cluster. We are experimenting to see if we can use one of the new NSF supercomputing centers to index the entire collection, but meanwhile indexes of sub-collections are sufficient for most purposes. —8— Manuscript of paper: Journal of Electronic Publishing, January 2008 Expertise The limiting factor in the development of cyberscholarship is likely to be a shortage of expertise. The convergence of high performance computing and large scale digital libraries requires technical skills that few universities possess. During the dot-com boom, universities lost most of their expertise to start-up companies. In particular, the development of automatic indexing moved from universities such as Cornell, Carnegie Mellon, and Stanford to the private sector. Skill in managing large collections of semistructured data, such as the Web, is concentrated in a few commercial organizations, such as Google, Yahoo, Amazon, and Microsoft, and a single not-for-profit, the Internet Archive. Recently Google, IBM, and others have begun initiatives to encourage universities to educate more students in the computing methods of massive data analysis. Hopefully, some of this expertise will remain at the universities and become part of the academic mainstream. Opportunities for change The underlying theme of this paper is that there is a symbiosis between the organization of library collections and the types of research that they enable. When large digital collections meet high performance computing, opportunities for new kinds of scholarship emerge, but these opportunities need new approaches to how collections are organized and how they are used. In a thought-provoking article on the future of academic libraries, David Lewis writes, "Real change requires real change" [16]. In universities and libraries some people are cautious in accepting change, while others welcome it. Some individuals were early adopters of Web searching, but others would not accept that automatic indexing could be more effective than indexing by skilled individuals [17]. Some scholars have been outspoken about the evils of Wikipedia [18], or have written articles that are critical of digitized books [19], while their colleagues are enthusiastic about the opportunities that new approaches offer. Overall, the academic community has benefited from early adoption of new technology, e.g., campus networks, electronic mail, and the Web. Online library catalogs, led by the Library of Congress and OCLC, were pioneering computing systems in their day. Information companies, such as Cisco and Google, have their roots in university projects. Cyberscholarship provides similar opportunities. For the academic community it provides new methodologies for research. For libraries and entrepreneurs the potential is almost unlimited. Acknowledgements Parts of this article come from a position paper for the NSF/JISC workshop in Phoenix, Arizona, April 2007: http://www.sis.pitt.edu/~repwkshop/papers/arms.html. —9— Manuscript of paper: Journal of Electronic Publishing, January 2008 The Cornell Web Lab is funded in part by National Science Foundation grants CNS0403340, SES-0537606, IIS 0634677, and IIS 0705774. Notes and references [1] William Y. Arms and Ronald L. Larsen (co-chairs). The future of scholarly communication: building the infrastructure for cyberscholarship. NSF/JISC workshop, Phoenix, Arizona, April 17 to 19, 2007. http://www.sis.pitt.edu/~repwkshop/NSF-JISCreport.pdf. [2] Gregory Crane. What do you do with a million books? Information science colloquium, Cornell University, March 2006. [3] See: Jefferson's Legacy, A brief history of the Library of Congress. Library of Congress Web site. http://www.loc.gov/loc/legacy/loc.html. [4] Gibbon's Library, New York Times, June 19, 1897. [5] USC Shoah Foundation Institute for Visual History and Education. http://www.usc.edu/schools/college/vhi/. [6] The National Digital Information Infrastructure and Preservation Program (NDIIPP) is described on the Library of Congress's digital preservation Web site: http://www.digitalpreservation.gov/. [7] Alex Szalay and Jim Gray, The World-Wide Telescope, Science, 293, 2037, 2001. The National Virtual Observatory's home page is: http://www.us-vo.org/. [8] The Entrez homepage is: http://www.ncbi.nlm.nih.gov/Entrez. The best description of Entrez is on Wikipedia (November 11, 2007): http://en.wikipedia.org/wiki/Entrez. [9] William Arms, Selcuk Aya, Pavel Dmitriev, Blazej Kot, Ruth Mitchell, and Lucia Walle, A research library based on the historical collections of the Internet Archive. DLib Magazine, 12 (2), February 2006. http://www.dlib.org/dlib/february06/arms/02arms.html. [10] The Internet Archive's home page is: http://www.archive.org/. [11] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan, Group formation in large social networks: membership, growth, and evolution. Twelfth Intl. Conference on Knowledge Discovery and Data Mining, 2006. [12] Donna Bergmark, Collection synthesis. ACM/IEEE Joint Conference on Digital Libraries, July 2002. — 10 — Manuscript of paper: Journal of Electronic Publishing, January 2008 [13] William Arms, Selcuk Aya, Pavel Dmitriev, Blazej Kot, Ruth Mitchell, and Lucia Walle, Building a research library for the history of the Web. ACM/IEEE Joint Conference on Digital Libraries, 2006. [14] Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters. Usenix SDI '04, 2004. http://www.usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf. [15] Hadoop is part of the Lucene-Apache suite of open-source software. See: http://lucene.apache.org/hadoop/. [16] David W. Lewis, A model for academic libraries 2005 to 2025, Visions of Change, California State University at Sacramento, January 26, 2007. [17] The letter by Weinheimer in D-Lib Magazine, September 2000, is a good example of judging the new by the standards of the old. http://www.dlib.org/dlib/september00/09letters.html#Weinheimer. [18] See, for example: Noam Cohen, A history department bans citing Wikipedia as a research source. New York Times, February 21, 2007. [19] See, for example: Robert B. Townsend, Google Books: Is it good for history? American Historical Association, September 12, 2007. http://www.historians.org/perspectives/issues/2007/0709/0709vie1.cfm. — 11 —