Cyberscholarship - Cornell University

advertisement
Manuscript of paper: Journal of Electronic Publishing, January 2008
In April 2007, the US National Science Foundation (NSF) and the British Joint
Information Systems Committee (JISC) held an invitational workshop on datadriven science and data-driven scholarship, co-chaired by Ronald Larsen and
William Arms, who jointly authored the final report. The report used the term
cyberscholarship to describe new forms of research that become possible when
high performance computing meets digital libraries [1]. Elsewhere in this issue of
the Journal of Electronic Publishing, Ronald Larsen describes the workshop and
its conclusions. In this article, William Arms gives a personal view of the
motivation behind the workshop and the roles of libraries and publishing in
achieving its goals.
Cyberscholarship: High Performance Computing meets Digital
Libraries
William Y. Arms
November 27, 2007
In a recent seminar, Gregory Crane of Tufts University made the simple but profound
statement, "When collections get large, only the computer reads every word" [2].
Thomas Jefferson's library, which formed the nucleus of the Library of Congress, had
6,487 volumes – a large private collection in its day [3]. Edward Gibbon's famous library
catalog had 2,675 records (which he wrote on the back of playing cards) [4]. Avid
readers, like Jefferson and Gibbon, can read and digest thousands of books, but modern
research libraries are so large that nobody can read everything in them.
The marriage of high performance computing with digital libraries introduces a
completely new notion of scale. High performance computing can bring together vast
quantities of material – datasets, manuscripts, reports, etc. – which might never make
their way into a traditional library. A scholar reads only a few hundreds of documents; a
supercomputer can analyze millions. Of course, a person has a rich understanding of
what is being read while a computer works at a very superficial level, but profound
research is possible by simple analysis of huge amounts of information. Computer
programs can identify latent patterns of information or relationships that will never be
found by human searching and browsing. This new form of research can be called
cyberscholarship.
Cyberscholarship becomes possible only when there are digital libraries with extensive
collections in a broad domain. Such collections are large even by supercomputing
standards. For instance, the USC Shoah Foundation's video collection of 52,000
interviews with survivors of genocide, notably the Nazi holocaust, is about 400 terabytes
[5]. The Library of Congress's National Digital Information Infrastructure and
Preservation Program (NDIIPP) supports collaborative projects to preserve digital
content [6]. There is no published estimate for the aggregate size of these projects, but a
rough estimate is at least one petabyte or 1,000 terabytes. Many research libraries have
begun to digitize the books in their collections, working with organizations such as the
—1—
Manuscript of paper: Journal of Electronic Publishing, January 2008
Open Content Alliance, Google, or Microsoft. These collections are becoming very
large. Digitized copies of the eight million books in the Cornell library alone is estimated
to be four petabytes of data.
A plausible view of today's developments sees these digital collections as the future of
research libraries. Already, many scientists never enter a library building or read the
printed copy of a journal. Their reports, working papers, conference proceedings,
datasets, images, audio, videos, software, and journal articles are created in digital
formats and accessed via networks, and a large proportion of older content has been
digitized. With books available in digital formats – either from the publishers or by
scanning older materials – scholars from other disciplines, including many social
scientists and humanities scholars, will also be able to work entirely from digital
materials. Perhaps the university library will cease to be the largest building on campus
and become the largest computing center.
Examples of cyberscholarship
How will scholars use these huge collections? How can high-performance computing
support novel research on large bodies of text, data, and other digital materials? New
forms of research need new types of libraries. To understand why, here are three
examples from different domains. The first looks at data-driven science, the second has a
focus on information discovery from heterogeneous sources, and the third is our own
work at Cornell on mining the history of the Web.
The National Virtual Observatory
The National Virtual Observatory has been a pioneer in data-driven science [7]. Its goal
is to bring together previously disjoint sets of astronomical data, in particular digital sky
surveys that have made observations at various wavelengths. Important astronomical
results that are not observable in a single dataset can be revealed by combined analysis of
data from these different surveys.
The datasets are archived at many locations and therefore the National Virtual
Observatory provides coordinated access to distributed data. Because the analyses are
complex and often computationally intensive, the virtual observatory expects researchers
to select data extracts from the archives and download them to their own computers. For
this purpose, the virtual observatory has developed an applications programming
interface (API), an XML encoding scheme for astronomical data, and a number of
applications that work with the API. Protocols are provided for specific tasks, such as a
simple image access protocol. The National Virtual Observatory has been funded by a
major NSF grant to astronomers Alex Szalay and Roy Williams. They worked closely
with the late Jim Gray, a leading computer scientist who specialized in very large
databases.
Entrez
—2—
Manuscript of paper: Journal of Electronic Publishing, January 2008
The Entrez system from the National Center for Biotechnology Information (NCBI) is
another early example of cyberscholarship [8]. NCBI is a division of the National
Library of Medicine within the National Institutes of Health (NIH). Entrez provides a
unified view of biomedical information from a wide variety of sources including the
PubMed citations and abstracts, the Medical Subject Headings, full text of journal articles
and books, databases such as the protein sequence database and Genbank, and computer
programs such as the Basic Local Alignment Search Tool (BLAST) for comparing gene
and protein sequences.
Entrez provides very powerful searching across the 23 databases in the system. For
example, somebody who is interested in a particular genetic sequence can extract every
occurrence of that sequence from all the information sources. Genetic sequences are
particularly convenient for information discovery because they are encoded by four
letters, A, C, G, and T, representing the four nucleotides of a DNA strand, but Entrez is
more than a cross-domain search service. It provides an applications programming
interface (API), so that researchers can use their own computers to explore this
information, and the Entrez Programming Utilities, which provide a structured interface
to all the databases. Thus researchers have a flexible set of tools to discover unexpected
patterns that are buried in the datasets.
The Cornell Web Lab
The final example is the Cornell Web Lab [9]. For many years, a group of us at Cornell
University have been interested in analyzing the Web collection of the Internet Archive
[10]. Approximately every two months since 1996, the Internet Archive, led by Brewster
Kahle, has collected a snapshot of the Web and preserved it for future generations. The
collection has more than 110 billion Web pages, or about 1.9 petabytes of compressed
data.
The Wayback Machine provides basic access to the collection. To use it, a user submits a
URL and receives a list of all the dates for which there are pages with that URL. If the
user clicks on a date, the page is returned in a form as close to the original as possible.
This is a wonderful system, but it only goes so far. It is organized for human retrieval of
single pages, not for computer programs that analyze millions. Researchers want more
flexibility. They want to ask questions such as, "What foreign government sites link to
www.law.cornell.edu?" or "As the Web grows, how has its structure changed?" or "What
was the origin of the myth that the Internet was doubling every six months?"
To answer such questions, researchers need many more ways to select pages than by
URL and date, as provided by the Wayback Machine, or by textual queries, as provided
by search engines such as Google. They also need powerful tools to transform raw data
into the form that is used for analysis. A colleague whose research is on Web clustering
algorithms mentioned that his team spends 90 percent of its effort obtaining test data, and
10 percent using that data for research. Much of the preliminary effort is spent finding a
suitable set of pages to study; the remaining effort is in preparing the data for analysis:
removing duplicates, clarifying ambiguous URLs, extracting terms and links, matching
—3—
Manuscript of paper: Journal of Electronic Publishing, January 2008
links to URLs, reconciling broken links or missing data, and much more. Since the test
collections are very large, these preparatory tasks and the subsequent analysis require
sophisticated programming and large amounts of computation.
The Web Lab's strategy to meet such needs is to copy of a large portion of the Internet
Archive's Web collection to Cornell, mount it on a powerful computer system, organize it
so that researchers have great flexibility in how they use it, and provide tools and services
that minimize the effort required to use the data in research.
Implications for publishers and libraries
If these three examples are indeed representative of a new form of scholarship, the
implications for publishers and libraries are profound, but there is limited experience to
guide the developments. Nobody is sure what services will be needed or how they will
be provided. Often it is easier to state the challenges than to articulate the solutions.
Here are some of the lessons that we have learned, based primarily on our experience in
developing the Web Lab.
Market research and incremental development
The first step in building the Web Lab was market research. In fall 2004, a group of
Cornell students interviewed 15 faculty members and graduate students from social
sciences and computer science. From these interviews, we gained an appreciation of
potential areas of research and some of the methodologies that might be used by different
disciplines. Computer scientists study the structure and evolution of the Web itself. A
series of classic experiments have explored the structure of the Web, but the experiments
require so much data that many have never been repeated; at most they have been
repeated a few times and rarely at different dates in the evolution of the Web. Social
scientists study the Web as a social phenomenon of interest in its own right and for the
evidence it provides of current social events, such as the spread of urban legends and the
development of legal concepts across time. In one experiment, computer scientists with
strong ties to sociology analyzed a billion items of data from LiveJournal, to see how the
tendency of an individual to join a community is influenced by the number of friends
within the community and by how those friends are connected to each other [11].
—4—
Manuscript of paper: Journal of Electronic Publishing, January 2008
Figure 1: The research dialog
Market research gives general guidance, but the detailed requirements can only be
developed incrementally. The Web Lab's strategy is to build a few generic tools and
services, work with researchers to discover what they need, and implement new services
steadily over time. Figure 1 shows the dialog between researchers and computer
scientists. The dialog begins with the researchers, e.g., sociologists studying the
diffusion of ideas. They have general ideas of studies they would like to do, based on
concepts that are rooted in their discipline. Those of us who are building the lab may not
know how to create the precise data structures, algorithms, and computing systems that
they would like to have, but we can suggest alternative approaches. These suggestions
may stimulate the researchers to devise new methods. Eventually, an experiment
emerges that is interesting from the discipline point of view and computationally feasible.
The experiment is carried out, both parties learn more, and the dialog continues.
Organizational implications
This incremental development requires a flexible organization and close collaboration
between researchers, publishers, and librarians. Each of the three examples – the
National Virtual Observatory, Entrez, and the Web Lab – has a large and complex
computer systems under the control of one organization. In theory it is possible to
provide comparable functionality in systems with distributed management, but, even
when the data and services are distributed, the practical challenges of coordination are so
—5—
Manuscript of paper: Journal of Electronic Publishing, January 2008
great that it is wiser to build a centralized organization specializing in an area of
cyberscholarship and serving all universities. The NSF has recently announced a
program to create several centers, known as Datanets, where the collection and
preservation of scientific data is closely integrated with the research that will make use of
it. From an earlier generation of libraries, OCLC provides an example of a centralized
organization that manages data on behalf of a large number of members.
Content
Cyberscholarship requires that the data exists and is accessible. One reason that the
NCBI and the National Virtual Observatory were able to make rapid progress is that
molecular biology and astronomy have traditions of early and complete publication of
research data, and well-managed archives to look after it. For the Web Lab, the data is
already collected and preserved by an established organization, the Internet Archive; our
role is to provide access for researchers. In other fields, however, the information is
widely scattered, or never published; experimental data is poorly archived or even
discarded. In a well-known example, scholars were denied access to large portions of the
Dead Sea Scrolls from the 1950s, when they were discovered, until 1991.
Once captured, access to the information in machine-readable formats is crucial. All the
information in a field must be available to the cyber scholar under reasonable terms and
conditions. Open access is ideal but not the only possibility.
Policy issues are closely tied to access. Even though the information in the Web Lab was
originally placed on the Web with open access, there are important policy issues in using
it for research. The copyright issues are well known and outside the scope of this article.
Privacy considerations may be more daunting. Data mining can violate privacy by
bringing together facts about an individual. Universities have strict policies on
experiments that involve human subjects and Cornell applies these policies to data
mining of the Web, but the situation becomes complicated when a digital library provides
services to many universities or to non-academic researchers. Finally, there is the
question of custodianship. These large digital library collections are important historical
records. The custodial duty is to preserve both the content and the identifying metadata.
It would be highly improper to destroy inconvenient data, or to index it in such a manner
that damaging information is never found.
Tools and services
Libraries provide tools and services to make the collections useful to their patrons, e.g.,
catalogs, bibliographies, classification schemes, abstracting and indexing services, and
knowledgeable reference librarians. These services have evolved over many years, based
on a collective view of how libraries are used. Conversely, the services that are available
have shaped the type of research that is carried out in libraries.
What tools and services are required for cyberscholarship? As the Web Lab develops, we
continually see new needs. An early request was for focused Web crawling, which is
—6—
Manuscript of paper: Journal of Electronic Publishing, January 2008
used to find Web pages fitting specified criteria. The request came from work in
educational digital libraries to automate the selection of materials for a library [12].
Other requests came from a study of how opinions about a company change across time.
One hypothesis is that anchor text (the linked text in Web pages) reflects such opinions.
An informal description of the experiment is to begin with the Web in 2005 and extract
the anchor text from all external links that point to the company's Web site. The text is
processed to eliminate navigational terms, such as "click here", or "back". The frequency
with which each word appears is counted and the process is repeated with data from
subsequent years. Trends in the frequencies provide evidence about how the company is
perceived.
This simple example uses several tools that process massive quantities of data. The first
is a way to select pages by attributes such as a domain name, format type, date, anchor
text, and links to and from each page. For this purpose, the Web Lab has created a
relational database with tables for all pages, URLs, and links. Currently it contains
metadata of the pages from four complete crawls and is about 50 terabytes in size. Other
tools are used to extract a clean set of links from a collection of pages and remove
duplicates. This is a messy task: many of the URLs embedded in Web pages refer to
pages that were never collected or never existed; URLs have many variations; because
each snapshot of the Web was collected over a period of weeks or months a URL may
refer to a page that changed before it was collected.
Some tools and services are specific to a field of research, but there are two requirements
that apply to every domain. The first is an application program interface (API), so that
computer programs can interact directly with the collections. Conventional digital
libraries have been developed on the assumption that the user is a person. Much effort is
placed in the design of user interfaces, but the computer interface is usually given low
priority. In contrast, the three examples in this article all assume that the primary form of
access will be via computer programs. Researchers want computer programs to act as
their agents, searching billions of items for patterns of information that are only vaguely
guessed at.
The next requirement is to be able to identify and download sub-collections. Often, the
scale of the full collection inhibits complex analyzes. Instead, the methodology is to
select part of the collection and download it to another computer for analysis. For this
purpose, digital library collections must be designed so that programs can extract large
parts of them. For example, Wikipedia encourages researchers to download everything:
the software, the content, the discussions, and all the historical trails. Currently, it is
about one terabyte, a size that can be analyzed on a smallish computer.
Using high-performance computing
Very large collections require high-performance computing. Few scholars would
recognize the Cornell Web Lab as a library, but the computer system looks very familiar
to the supercomputing community. An early version is described in [13]. Powerful
computer systems at the Internet Archive in San Francisco and Cornell University in
—7—
Manuscript of paper: Journal of Electronic Publishing, January 2008
upstate New York are connected by research networks – Internet2 and the National
LambdaRail – which have so much capacity that complete snapshots of the Web are
routinely transferred across the country. The Internet Archive operates very large
clusters of Linux computers, while Cornell has several smaller computer clusters and a
large database server dedicated to the Web Lab.
One of the most interesting challenges is to help researchers who are not computing
specialists use these systems. Until very recently, high-performance computing was a
very specialized area. Even today, experienced computing professional are often
discouraged by the complexities of parallel programming and managing huge datasets.
Fortunately this situation is changing.
In the 1960s, the Fortran programming language first gave scientists a simple way to
translate their mathematical problems into efficient computer codes. Cyber scholars need
a similarly simple way to express massive data analysis as programs that run efficiently
on large computer clusters. The map/reduce programming paradigm developed by the
Lisp community and refined and made popular by Google appears to fulfill this need
[14]. There is an excellent, open-source software system called Hadoop that combines
map/reduce programming with a distributed file system [15]. In the Web Lab, we have
used it for tasks such as full text indexing, the anchor text study described above, and
extraction of the graph of the links between Web pages. While we are still inexperienced
in running large jobs, the programming required of the researchers is reassuringly
straightforward.
Computational limits
Modern computers are so powerful that it is easy to forget that they have limits, but
techniques that are excellent for moderately large digital libraries fail with very large
collections. The Web Lab tries to isolate cyber scholars from the complex systems that
are serving them, but they have to recognize that, with billions of records and hundreds of
terabytes of data, apparently simple operations can run forever. We need better ways to
help researchers recognize these limits, design experiments that do not exceed them, and
off-load appropriate tasks to more powerful computers.
For instance, users of the Web Lab want full-text indexes of the collections that they are
studying. The Internet Archive and Cornell University have considerable computing
resources, but not enough to index a hundred billion Web pages. Fortunately, no
researcher wants to analyze every page in the history of the entire Web. In one set of
experiments, researchers are analyzing a sub-collection of 30 million pages extracted
from the amazon.com domain. This collection would be considered large in other
contexts, but there is little difficulty in indexing it on a medium sized computer cluster.
We are experimenting to see if we can use one of the new NSF supercomputing centers to
index the entire collection, but meanwhile indexes of sub-collections are sufficient for
most purposes.
—8—
Manuscript of paper: Journal of Electronic Publishing, January 2008
Expertise
The limiting factor in the development of cyberscholarship is likely to be a shortage of
expertise. The convergence of high performance computing and large scale digital
libraries requires technical skills that few universities possess. During the dot-com boom,
universities lost most of their expertise to start-up companies. In particular, the
development of automatic indexing moved from universities such as Cornell, Carnegie
Mellon, and Stanford to the private sector. Skill in managing large collections of semistructured data, such as the Web, is concentrated in a few commercial organizations, such
as Google, Yahoo, Amazon, and Microsoft, and a single not-for-profit, the Internet
Archive.
Recently Google, IBM, and others have begun initiatives to encourage universities to
educate more students in the computing methods of massive data analysis. Hopefully,
some of this expertise will remain at the universities and become part of the academic
mainstream.
Opportunities for change
The underlying theme of this paper is that there is a symbiosis between the organization
of library collections and the types of research that they enable. When large digital
collections meet high performance computing, opportunities for new kinds of scholarship
emerge, but these opportunities need new approaches to how collections are organized
and how they are used.
In a thought-provoking article on the future of academic libraries, David Lewis writes,
"Real change requires real change" [16]. In universities and libraries some people are
cautious in accepting change, while others welcome it. Some individuals were early
adopters of Web searching, but others would not accept that automatic indexing could be
more effective than indexing by skilled individuals [17]. Some scholars have been
outspoken about the evils of Wikipedia [18], or have written articles that are critical of
digitized books [19], while their colleagues are enthusiastic about the opportunities that
new approaches offer.
Overall, the academic community has benefited from early adoption of new technology,
e.g., campus networks, electronic mail, and the Web. Online library catalogs, led by the
Library of Congress and OCLC, were pioneering computing systems in their day.
Information companies, such as Cisco and Google, have their roots in university projects.
Cyberscholarship provides similar opportunities. For the academic community it
provides new methodologies for research. For libraries and entrepreneurs the potential is
almost unlimited.
Acknowledgements
Parts of this article come from a position paper for the NSF/JISC workshop in Phoenix,
Arizona, April 2007: http://www.sis.pitt.edu/~repwkshop/papers/arms.html.
—9—
Manuscript of paper: Journal of Electronic Publishing, January 2008
The Cornell Web Lab is funded in part by National Science Foundation grants CNS0403340, SES-0537606, IIS 0634677, and IIS 0705774.
Notes and references
[1] William Y. Arms and Ronald L. Larsen (co-chairs). The future of scholarly
communication: building the infrastructure for cyberscholarship. NSF/JISC workshop,
Phoenix, Arizona, April 17 to 19, 2007. http://www.sis.pitt.edu/~repwkshop/NSF-JISCreport.pdf.
[2] Gregory Crane. What do you do with a million books? Information science
colloquium, Cornell University, March 2006.
[3] See: Jefferson's Legacy, A brief history of the Library of Congress. Library of
Congress Web site. http://www.loc.gov/loc/legacy/loc.html.
[4] Gibbon's Library, New York Times, June 19, 1897.
[5] USC Shoah Foundation Institute for Visual History and Education.
http://www.usc.edu/schools/college/vhi/.
[6] The National Digital Information Infrastructure and Preservation Program (NDIIPP)
is described on the Library of Congress's digital preservation Web site:
http://www.digitalpreservation.gov/.
[7] Alex Szalay and Jim Gray, The World-Wide Telescope, Science, 293, 2037, 2001.
The National Virtual Observatory's home page is: http://www.us-vo.org/.
[8] The Entrez homepage is: http://www.ncbi.nlm.nih.gov/Entrez. The best description
of Entrez is on Wikipedia (November 11, 2007): http://en.wikipedia.org/wiki/Entrez.
[9] William Arms, Selcuk Aya, Pavel Dmitriev, Blazej Kot, Ruth Mitchell, and Lucia
Walle, A research library based on the historical collections of the Internet Archive. DLib Magazine, 12 (2), February 2006.
http://www.dlib.org/dlib/february06/arms/02arms.html.
[10] The Internet Archive's home page is: http://www.archive.org/.
[11] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan, Group
formation in large social networks: membership, growth, and evolution. Twelfth Intl.
Conference on Knowledge Discovery and Data Mining, 2006.
[12] Donna Bergmark, Collection synthesis. ACM/IEEE Joint Conference on Digital
Libraries, July 2002.
— 10 —
Manuscript of paper: Journal of Electronic Publishing, January 2008
[13] William Arms, Selcuk Aya, Pavel Dmitriev, Blazej Kot, Ruth Mitchell, and Lucia
Walle, Building a research library for the history of the Web. ACM/IEEE Joint
Conference on Digital Libraries, 2006.
[14] Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on
large clusters. Usenix SDI '04, 2004.
http://www.usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf.
[15] Hadoop is part of the Lucene-Apache suite of open-source software. See:
http://lucene.apache.org/hadoop/.
[16] David W. Lewis, A model for academic libraries 2005 to 2025, Visions of Change,
California State University at Sacramento, January 26, 2007.
[17] The letter by Weinheimer in D-Lib Magazine, September 2000, is a good example of
judging the new by the standards of the old.
http://www.dlib.org/dlib/september00/09letters.html#Weinheimer.
[18] See, for example: Noam Cohen, A history department bans citing Wikipedia as a
research source. New York Times, February 21, 2007.
[19] See, for example: Robert B. Townsend, Google Books: Is it good for history?
American Historical Association, September 12, 2007.
http://www.historians.org/perspectives/issues/2007/0709/0709vie1.cfm.
— 11 —
Download