Data Grids: A New Computational Infrastructure

advertisement
Paul Avery
Nov. 10, 2001
Version 5
Data Grids: A New Computational Infrastructure
for Data Intensive Science
Abstract
Twenty-first century scientific and engineering enterprises are increasingly characterized by their geographic dispersion and their reliance on large data archives.
These characteristics bring with them unique challenges. First, the increasing size
and complexity of modern data collections require significant investments in information technologies to store, retrieve and analyze them. Second, the increased
distribution of people and resources in these projects has made resource sharing
and collaboration across significant geographic and organizational boundaries
critical to their success.
In this paper I explore how computing infrastructures based on Data Grids offer
data intensive enterprises a comprehensive, scalable framework for collaboration
and resource sharing. A detailed example of a Data Grid framework is presented
for a Large Hadron Collider experiment, where a hierarchical set of laboratory
and university resources comprising petaflops of processing power and a multipetabyte data archive must be efficiently utilized by a global collaboration. The
experience gained with these new information systems, providing transparent
managed access to massive distributed data collections, will be applicable to
large-scale data-intensive problems in a wide spectrum of scientific and engineering disciplines, and eventually in industry and commerce. Such systems will be
needed in the coming decades as a central element of our information-based society.
Keywords: Grids, data, virtual data, data intensive, petabyte, petascale, petaflop, virtual organization, griphyn, ppdg, cms, lhc, cern.
1
DATA GRIDS: A NEW COMPUTATIONAL INFRASTRUCTURE FOR DATA
INTENSIVE SCIENCE ................................................................................................................ 1
1
INTRODUCTION................................................................................................................. 3
2
DATA INTENSIVE ACTIVITIES...................................................................................... 4
3
DATA GRIDS AND DATA INTENSIVE SCIENCES...................................................... 5
4
DATA GRID DEVELOPMENT FOR THE LARGE HADRON COLLIDER .............. 6
4.1
4.2
4.3
4.4
5
DATA GRID ARCHITECTURE ...................................................................................... 10
5.1
5.2
5.3
6
THE CMS DATA GRID ..................................................................................................... 7
THE TIER CONCEPT .......................................................................................................... 8
ELABORATION OF THE CMS DATA GRID MODEL ............................................................ 8
ADVANTAGES OF THE CMS DATA GRID MODEL ............................................................. 9
GLOBUS-BASED INFRASTRUCTURE ................................................................................ 10
VIRTUAL DATA .............................................................................................................. 10
DEVELOPMENT OF DATA GRID ARCHITECTURE ............................................................. 11
EXAMPLES OF MAJOR DATA GRID EFFORTS TODAY ....................................... 11
6.1
6.2
6.3
6.4
6.5
6.6
THE TERAGRID .............................................................................................................. 12
PARTICLE PHYSICS DATA GRID ..................................................................................... 13
GRIPHYN PROJECT ........................................................................................................ 13
EUROPEAN DATA GRID .................................................................................................. 14
CROSSGRID .................................................................................................................... 14
INTERNATIONAL VIRTUAL DATA GRID LABORATORY AND DATATAG ......................... 15
7
COMMON INFRASTRUCTURE ..................................................................................... 15
8
SUMMARY ......................................................................................................................... 15
REFERENCES ............................................................................................................................ 16
2
1
Introduction
Twenty-first century scientific and engineering enterprises are increasingly characterized by their
geographic dispersion and their reliance on large data archives. These characteristics bring with
them unique challenges. First, the increasing size and complexity of modern data collections require significant investments in information technologies to store, retrieve and analyze them.
Second, the increased distribution of people and resources in these projects has made resource
sharing and collaboration across significant geographic and organizational boundaries critical to
their success.
Infrastructures known as “Grids”[1,2] are being developed to address the problem of resource
sharing. An excellent introduction to Grids can be found in the article [3], “The Anatomy of the
Grid”, which provides the following interesting description:
“The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The sharing that we are concerned with is not primarily file exchange
but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily,
highly controlled, with resource providers and consumers defining clearly and
carefully just what is shared, who is allowed to share, and the conditions under
which sharing occurs. A set of individuals and/or institutions defined by such
sharing rules form what we call a virtual organization (VO).”
The existence of very large distributed data collections adds a significant new dimension to enterprise-wide resource sharing, and has led to substantial research and development effort on
“Data Grid” infrastructures, capable of supporting this more complex collaborative environment.
This work has taken on more urgency for new scientific collaborations, which in some cases will
reach global proportions and share data archives with sizes measured in dozens or even hundreds
of petabytes within a decade. These collaborations have recognized the strategic importance of
Data Grids for realizing the scientific potential of their experiments, and have begun working
with computer scientists, members of other scientific and engineering fields and industry to research and develop this new technology and create production-scale computational environments. Figure 1 shows a U.S. based Data Grid consisting of a number of heterogeneous resources.
My aim in this paper is to review Data Grid technologies and how they can benefit data intensive
sciences. Developments in industry are not included here, but since most Data Grid work is
presently carried out to address the urgent data needs of advanced scientific experiments, the
omission is not a serious one. (The problems solved while dealing with these experiments will in
any case be of enormous benefit to industry in a short time.) Furthermore, I will concentrate on
those projects which are developing Data Grid infrastructures for a variety of disciplines, rather
than “vertically integrated” projects that benefit a single experiment or discipline, and explain the
specific challenges faced by those disciplines.
3
2
Data intensive activities
The number and diversity of data intensive projects is expanding rapidly. The following recounting of projects is presented as a survey that, while incomplete, shows the scope of and immense interest in data intensive methods in solving scientific problems.
Physics and space sciences: High energy and nuclear physics experiments at accelerator laboratories at Fermilab, Brookhaven and SLAC already generate dozens to hundreds of terabytes of
colliding beam data per year that is distributed to and analyzed by hundreds of physicists around
the world to search for subtle new interactions. Upgrades to these experiments and new experiments planned for the Large Hadron Collider at CERN will increase data rates to petabytes per
year. Gravitational wave searches at LIGO, VIRGO and GEO will accumulate yearly samples of
approximately 100 terabytes of mostly environmental and calibration data that must be correlated
and filtered to search for rare gravitational events. New multi-wavelength all-sky surveys utilizing telescopes instrumented with gigapixel CCD arrays will soon drive yearly data collection
rates from terabytes to petabytes. Similarly, remote-sensing satellites operating at multiple
wavelengths will generate several petabytes of spatial-temporal data that can be studied by researchers to accurately measure changes in our planet’s support systems.
Biology and medicine: Biology and medicine are rapidly increasing their dependence on data
intensive methods. New generation X-ray sources coupled with improved data collection methods are expected to generate copious data samples from biological specimens with ultra-short
time resolutions. Many organism genomes are been sequenced by extremely fast “shotgun”
methods requiring enormous computational power. The resulting flow of genome data is rising
exponentially, with standard databases doubling in size every few months. Sophisticated searches of these databases will require much larger computational resources as well as new statistical
methods and metadata to keep up with the flow of data. Proteomics, the study of protein structure and function, is expected to generate enormous amounts of data, easily dwarfing the data
samples obtained from genome studies. When applied to protein-protein interactions  of extreme importance to drug designers  these studies will require additional orders of magnitude
increases in computational capacity (to hundreds of petaflops) and storage sizes (to thousands of
petabytes). In medicine, a single three dimensional brain scan can generate a significant fraction
of a terabyte of data, while systematic adoption of digitized radiology scans will produce dozens
of petabytes of data that can be quickly accessed and searched for breast cancer and other diseases. Exploratory studies have shown the value of converting patient records to electronic form
and attaching digital CAT scans, X-Ray charts and other instrument data, but systematic use of
such methods would generate databases many petabytes in size. Medical data pose additional
ethical and technical challenges stemming from exacting security restrictions on access to this
data and patient identification.
Computer simulations: Advances in information technology in recent years have given scientists and engineers the ability to develop sophisticated simulation and modeling techniques for
improved understanding of the behavior of complex systems. When coupled to the huge processing power and storage resources available in supercomputers or large computer clusters,
these advanced simulation and modeling methods become tools of rare power, permitting detailed and rapid studies of physical processes while sharply reducing the need to conduct lengthy
and costly experiments or to build expensive prototypes. The following examples provide a hint
of the potential of modern simulation methods. High energy and nuclear physics experiments
routinely generate simulated datasets whose size (in the multi-terabyte range) is comparable to
4
and sometimes exceeds the raw data collected by the same experiment. Supercomputers generate enormous databases from long-term simulations of climate systems with different parameters
that can be compared with one another and with remote satellite sensing data. Environmental
modeling of bays and estuaries using fine-scale fluid dynamics calculations generates massive
datasets that permit the calculation of pollutant dispersal scenarios under different assumptions
that can be compared with measurements. These projects also have geographically distributed
user communities who must access and manipulate these databases.
3
Data Grids and Data Intensive Sciences
To develop the argument that Data Grids offer a comprehensive solution to data intensive activities, I first summarize some general features of Grid technologies. These technologies comprise
a mixture of protocols, services, and tools that are collectively called “middleware”, reflecting
the fact that they are accessed by “higher level” applications or application tools while they in
turn invoke processing, storage, network and other services at “lower” software and hardware
levels. Grid middleware includes security and policy mechanisms that work across multiple institutions; resource management tools that support access to remote information resources and
simultaneous allocation (“co-allocation”) of multiple resources; general information protocols
and services that provide important status information about hardware and software resources,
site configurations, and services; and data management tools that locate and transport datasets
between storage systems and applications.
The diagram in Figure 2 outlines in a simple way the roles played by these various Grid technologies. The lowest level Fabric contains shared resources such as computer clusters, data storage
systems, catalogs, networks, etc. that Grid tools must access and manipulate. The Resource and
Connectivity layers provide, respectively, access to individual resources and the communication
and authentication tools needed to communicate with them. Coordinated use of multiple resources – possibly at different sites – is handled by Collective protocols, APIs, and services.
Applications and application toolkits utilize these Grid services in myriad ways to provide “Gridaware” services for members of a particular virtual organization. A much more detailed explication of Grid architecture can be found in reference [3].
While standard Grid infrastructures provide distributed scientific communities the ability collaborate and share resources, additional capabilities are needed to cope with the specific challenges
associated with scientists accessing and manipulating very large distributed data collections.
These collections, ranging in size from terabytes (TB) to petabytes (PB), comprise raw (measured) and many levels of processed or refined data as well as comprehensive metadata describing, for example, under what conditions the data was generated or collected, how large it is, etc.
New protocols and services must facilitate access to significant tertiary (e.g., tape) and secondary
(disk) storage repositories to allow efficient and rapid access to primary data stores, while taking
advantage of disk caches that buffer very large data flows between sites. They also must make
efficient use of high performance networks that are critically important for the timely completion
of these transfers. Thus to transport 10 TB of data to a computational resource in a single day
requires a 1 Gigabit per second network operated at 100% utilization. Efficient use of these extremely high network bandwidths also requires special software interfaces and programs that in
most cases have yet to be developed.
The computational and data management problems encountered in data intensive research include the following challenging aspects:
5

Computation-intensive as well as data-intensive: Analysis tasks are compute-intensive
and data-intensive and can involve hundreds or even thousands of computer, data handling, and network resources. The central problem is coordinated management of computation and data, not just data curation and movement.

Need for large-scale coordination without centralized control: Rigorous performance
goals require coordinated management of numerous resources, yet these resources are, for
both technical and strategic reasons, highly distributed and not always amenable to tight
centralized control.

Large dynamic range in user demands and resource capabilities: It must be possible to
support and arbitrate among a complex task mix of experiment-wide, group-oriented, and
(perhaps thousands of) individual activities—using I/O channels, local area networks, and
wide area networks that span several distance scales.

Data and resource sharing: Large dynamic communities would like to benefit from the
advantages of intra and inter community sharing of data products and the resources needed to produce and store them.
The “Data Grid” has been introduced as a unifying concept to describe the new technologies required to support such next-generation data-intensive applications—technologies that will be
critical to future data-intensive computing in the many areas of science and commerce in which
sophisticated software must harness large amounts of computing, communication and storage
resources to extract information from data. Data Grids are typically characterized by the following elements: (1) they have large extent (national and even global) and scale (many resources
and users); (2) they layer sophisticated new services on top of existing local mechanisms and interfaces, facilitating coordinated sharing of remote resources; and (3) they provide a new dimension of transparency in how computational and data processing are integrated to provide data
products to user applications. This transparency is vitally important for sharing heterogeneous
distributed resources in a manageable way, a point to which I return later.
4
Data Grid Development for the Large Hadron Collider
I now turn to a particular experimental activity, high energy physics at the Large Hadron Collider
(LHC) at CERN, to provide a concrete example of how a Data Grid computing framework might
be implemented to meet the demanding computational and collaborative needs of a data intensive experiment. The extreme requirements of experiments at the LHC, due to commence operations in 2006, have been known to physicists for several years. Distributed architectures based
on Data Grid technologies have been proposed as solutions, and a number of initiatives are now
underway to develop implementations at increasing levels of scale and complexity. The particular architectural solution shown here is not necessarily the most effective or efficient one for other disciplines, but it does offer some valuable insights about the technical and even political merits of Data Grids when applied to large-scale problems.
For definitiveness, and without sacrificing much generality, I focus on the CMS [30] experiment
at the LHC. CMS faces computing challenges of unprecedented scale in terms of data volume,
processing requirements, and the complexity and distributed nature of the analysis and simulation tasks among thousands of physicists worldwide. The detector will record events* at a rate of
*
An “event” is the collision of two energetic beam particles that produces dozens or hundreds of secondary particles.
6
approximately 100 Hz, accumulating 100 MB/sec of raw data or several petabytes per year of
raw and processed data in the early years of operation. The data storage rate is expected to grow
in response to the pressures of higher luminosity, new physics triggers and better storage capabilities, leading to data collections of approximately 20-30 PB by 2010 rising to several hundred
petabytes over the following decade.
The computational resources required to reconstruct and simulate this data are similarly vast.
Each raw event will be approximately 1 MB in size and require roughly 3000 SI95-sec* to reconstruct and 5000 SI95-sec to fully simulate. Estimates based on the initial 100 Hz data rate have
yielded a total computing capability of approximately 1800K SI95 in the first year of operation,
rising rapidly to keep up with the total accumulated data. Approximately one third of this capability will reside at CERN and the remaining two thirds will be provided by the member nations
[4]. To put the numbers in context, 1800K SI95s is roughly equivalent to 60,000 of today’s
high-end PCs.
4.1
The CMS Data Grid
The challenge facing CMS is how to build an infrastructure that will provide these computational
and storage resources and permit their effective and efficient use by a scientific community of
several thousand physicists spread across the globe. Simple scaling arguments as well as more
sophisticated studies using the MONARC [5] simulation package have shown that a distributed
infrastructure based on regional centers provides the most effective technical solution, since
large geographical clusters of users are close to the datasets and resources that they employ. A
distributed configuration is also preferred from a political perspective since it allows local control of resources and some degree of autonomy in pursuing physics objectives.
As discussed in the previous section, the distributed computing model can be made even more
productive by arranging CMS resources as a Grid, specifically a Data Grid in which large computational resources and massive data collections (including cached copies and data catalogs)
linked by very high-speed networks and sophisticated middleware software form a single computational resource accessible from any location. The Data Grid framework is expected to play a
key role in realizing CMS’ scientific potential by transparently mobilizing large-scale computing
resources for large-scale computing tasks and by providing a collaboration-wide computing fabric that permits full participation in the CMS research program by physicists at their home institutes. This latter point is particularly relevant for participants in remote or distant regions. As a
result, a highly distributed, hierarchical computing infrastructure exploiting Grid technologies is
a central element of the CMS worldwide computing model.
It should be noted that while LHC data taking will not begin until 2006, CMS already has large
and highly distributed computing and software operations. These operations serve immediate
and near term needs such as test beam data analysis, detector performance and physics simulation studies that support detector design and optimization, software development and associated
scalability studies, and “Data Challenges” involving high-throughput, high-volume stress tests of
offline processing software and facilities. These early needs require Grid R&D, software development, engineering tests and staged deployment of significant computing resources (including
prototypes) to take place throughout the 2001–2006 period.
*
SI95 is an abbreviation for SpecInt95, a standard unit of processor non-floating point speed.
7
The CMS software and computing resources will be coherently shared and managed as a Data
Grid, following the ideas developed by the GriPhyN [6], Particle Physics Data Grid [7] (PPDG)
and European DataGrid [8] projects. The distributed nature of the system allows one to take advantage of the funding and manpower resources available in each region, and to balance proximity of large datasets to appropriately large centralized processing resources against proximity
of smaller frequently accessed data to the end users, and to take advantage of support and expertise in different time zones for efficient data analysis.
4.2
The Tier Concept
The computing resources of CMS will be arranged in a “hierarchy” of five levels, interconnected
by high-speed regional, national and international networks:

Tier0: The central facility at CERN where the experimental data is taken, and where all
the raw data is stored and initially reconstructed.

Tier1: A major national center supporting the full range of computing, data handling and
support services required for effective analysis by a community of several hundred physicists. Much of the re-reconstruction as well as the analysis of the data will be carried out
at these centers, which for the most part will be located at national laboratories.

Tier2: A smaller system supporting analysis and reconstruction on demand by a community of typically 30–50 physicists, sited at a university or research laboratory.

Tier3: A workgroup cluster specific to a university department or a single high energy
physics group, of the sort traditionally used to support local needs for data analysis.

Tier4: An access device such as individual user’s desktop, laptop or even mobile device.
Figure 3 shows a simplified view of the architecture of the CMS Data Grid system and the data
rates involved. The general concept of the distributed computing model is well illustrated here,
though the network bandwidths shown are approximate and are the subject of ongoing study. [9]
The partitioning of resources is expected to be roughly 1:1:1 among the CERN Tier0 center, the
sum of the Tier1 centers, and the sum of the larger number of Tier2 centers. In practice this
means that each Tier1 should have approximately 15-20 percent of the capability of the Tier0
and each Tier2 should have approximately 20 percent of the capability of a Tier1. It is expected
that the worldwide CMS distributed computing system there will have approximately 5-6 Tier1
laboratories.
4.3
Elaboration of the CMS Data Grid Model
The purpose of the Tier1 and Tier2 centers is to provide the enabling infrastructure to allow
physicists in each world region to participate fully in the physics program of their experiment,
including when they are at their home institutions. The enabling infrastructure consists of the
software, computing and data handling hardware, and the support services needed to access, analyze, manage and understand the experimental data. The Tier1 national centers are foreseen to
provide a full range of data handling as well as computational and user support services, and
each will serve a community of approximately 200-500 physicists [10]. The Tier1 centers will
be sited in the U.S., France, Germany, Italy, the UK, and Russia. Unlike the U.S. center, the European sites will serve multiple LHC experiments.
8
Several models for Tier2 operation have been discussed. Typically, a Tier2 center would be configured to serve a physics community of roughly 30–50 physicists, using a cost-effective cluster
of rack mounted computational devices with sufficient memory to run large applications, interconnected with commodity (e.g., Ethernet) networking, for simulation production, reconstruction
and analysis. Tier2 sites will be capable of contributing to the distributed production of reconstructed data on a small to medium scale, and will require relatively modest local manpower for
system operation. Data that is to be served with high availability and performance is stored in
multi-terabyte RAID arrays. A small tape library is used for tape import/export, and may also be
used to backup data and software, especially if there is no large archival tape system at the Tier 2
site.
Tier2 centers will, except in a few cases, lack the full range of data handling and support services
that are required at a Tier1 center. Each center may help to serve one region or a functional subgroup within a large country, working in partnership with the national Tier1 center, or it may
serve a small country whose national needs do not require a Tier1. A strong design consideration is that Tier2 sites should have the flexibility to respond more quickly than the productionoriented Tier1 laboratories to software tests and changes in physics priorities and their hardware
configurations should be chosen so as to be manageable by a small support team.
A Tier3 site consists of a workgroup cluster specific to a university department or a single high
energy physics group, of the sort traditionally used to support local needs for data analysis, typically serving 5–15 physicists. The scale of these systems is expected to be such that their use for
organized production may be limited in general (though some universities may have substantial
resources that could be harnessed for large productions), and restricted to low-IO applications
such as simulations and small analyses.
4.4
Advantages of the CMS Data Grid Model
The Tier2 centers in the distributed computing model bring with them a number of other advantages, apart from that of gathering and managing resources worldwide to meet the aggregate
computing needs. First, by balancing on-demand local use and centrally coordinated “production” use, each experiment can ensure that individual scientists and small workgroups have the
means at their disposal to develop new software and new lines of analysis efficiently. In many
cases, siting frequently accessed data close to the users leads to more rapid turnaround, as a result of the higher throughput and shorter delays achievable over local area and/or shorter wide
area network links.
Second, situating Tier2 centers at universities and multi-purpose research laboratories combines
the research and education mission, making students and young scientists part of the ongoing
process of exploration and discovery, including many who will rarely be able to visit the central
laboratory.
Third, by placing one or more appropriately-scaled centers in each world region, the scope of the
analysis will lead to an increased number of intellectual focal points for student-faculty interaction and mentoring. The distributed nature of the Tier2s and their flexible balance between ondemand use by local and regional groups and organized use coordinated by the Tier1 center will
also lead to new modes of daily partnership between the universities and laboratories, with greater continuity and less reliance on people’s travel schedules as well as more effective and creative
collaborative work among the member of small teams.
9
Finally, CMS plans to deploy next-generation remote collaborative systems [11] at each participating institution as a way of ensuring coherency among research efforts. These systems include
videoconferencing, but also encompass a wider set of tools such as document sharing, whiteboards, remote test beam and instrument control and remote visualization. Such systems and the
protocols needed to support them are under active development by research groups and industry
and are being deployed in a variety of settings. The group-oriented analysis activities surrounding periodic “data challenges” and other experiment wide milestones will help CMS optimize
these collaborative systems prior to the arrival of real data in 2006.
5
5.1
Data Grid Architecture
Globus-Based Infrastructure
The Globus Project [26], a joint research and development effort of Argonne National Laboratory, the Information Sciences Institute of the University of Southern California, and the University of Chicago, has over the past several years developed the most comprehensive and widely
used Grid framework available today. The widespread adoption of Globus technologies is due in
large measure to the fact that its members work closely with a variety of scientific and engineering projects while maintaining compatibility with other computer science toolkits used by these
projects. Globus tools are being studied, developed, and enhanced at institutions worldwide to
create new Grids and services, and to conduct computing research.
Globus components provide the capabilities to create Grids of computing resources and users;
track the capabilities of resources within a Grid; specify the resource needs of user’s computing
tasks; mutually authenticate both users and resources; and deliver data to and from remotely executed computing tasks. Globus is distributed in a modular and open toolkit form, facilitating the
incorporation of additional services into scientific environments and applications. For example,
Globus has been integrated with technologies such as the Condor high-throughput computing
environment [12,13] and the Portable Batch System (PBS) job scheduler [14]. Both of these integrations demonstrate the power and value of the open protocol toolkit approach. Experience
with projects such as the NSF’s National Technology Grid [15], NASA’s Information Power
Grid [16] and DOE ASCI’s Distributed Resource Management Project [17] have provided considerable experience with the creation of production infrastructures.
5.2
Virtual Data
Virtual data is of key importance for exploiting the full potential of Data Grids. The concept actually refers to two related ideas: transparency with respect to location, where data might be
stored in a number of different locations, including caches, to improve access speed and/or reliability, and transparency with respect to materialization, where the data can be computed by an
algorithm, to facilitate defining, sharing, and use of data derivation mechanisms. Both characteristics enable the definition and delivery of a potentially unlimited virtual space of data products
derived from other data. In this virtual space, requests can be satisfied via direct retrieval of materialized products and/or computation, with local and global resource management, policy, and
security constraints determining the strategy used. The concept of virtual data recognizes that all
except irreproducible raw experimental data need ‘exist’ physically only as the specification for
how they may be derived. Grid infrastructures may instantiate zero, one, or many copies of derivable data depending on probable demand and the relative costs of computation, storage, and
transport. On a much smaller scale, this dynamic processing, construction, and delivery of data
10
is precisely the strategy used to generate much, if not most, of the web content delivered in response to queries today. Figure 4 provides a simple illustration of how virtual data might be
computed or fetched from local, region or large computing facilities.
Consider an astronomer using SDSS to investigate correlations in galaxy orientation due to
lensing effects by intergalactic dark matter [18,19,20]. A large number of galaxies—some 107—
must be analyzed to get good statistics, with careful filtering to avoid bias. For each galaxy, the
astronomer must first obtain an image, a few pixels on a side; process it in a computationally intensive analysis; and store the results. Execution of this request involves virtual data catalog accesses to determine whether the required analyses have been previously constructed. If they
have not, the catalog must be accessed again to locate the applications needed to perform the
transformation and to determine whether the required raw data is located in network cache, remote disk systems, or deep archive. Appropriate computer, network, and storage resources must
be located and applied to access and transfer raw data and images, produce the missing images,
and construct the desired result. The execution of this single request may involve thousands of
processors and the movement of terabytes of data among archives, disk caches, and computer
systems nationwide.
Virtual data grid technologies are expected to be of immediate benefit to numerous other scientific and engineering application areas. For example, NSF and NIH fund scores of X-ray crystallography labs that together are generating many terabytes (soon petabytes) of molecular structure
data each year. Only a small fraction of this data is being shared via existing publication mechanisms. Similar observations can be made concerning long-term seismic data generated by geologists, data synthesized from studies of the human genome database, brain imaging data, output
from long-duration, high-resolution climate model simulations, and data produced by NASA’s
Earth Observing System.
5.3
Development of Data Grid Architecture
As described in the next Section, a number of scientific collaborations are developing and deploying Data Grids, a fact that has led Grid researchers to look at general approaches to developing Data Grid architectures. A Data Grid architecture must have a highly modular structure, defining a variety of components, each with its own protocol and/or application programmer interfaces (APIs). These various components also should be designed for use in an integrated fashion, as part of a comprehensive Data Grid architecture. The modular structure facilitates integration with discipline-specific systems that already have substantial investments in data management systems, a consequence of their need to provide high-performance data services top their
users.
6
Examples of Major Data Grid Efforts Today
I describe in this section some of the major projects that have been undertaken to develop Data
Grids. Without exception, these efforts have been driven by the overwhelming and near-term
needs of current and planned scientific experiments, and have led to fruitful collaborations between application scientists and computer scientists. High-energy physics has been at the forefront of this activity, a fact that can be attributed to its historical standing as both a highly data
intensive and highly collaborative discipline. As noted in the last section, however, the existing
high-energy physics computing infrastructures are not scalable to upgraded experiments at
SLAC, Fermilab and Brookhaven, and new experiments at the LHC, that will generate petabytes
11
of data per year and be analyzed by global collaborations. Participants in other planned experiments in nuclear physics, gravitational research, large digital sky surveys, and virtual astronomical observatories, fields which also face challenges associated with massive distributed data collections and a dispersed user community, have also decided to adopt Data Grid computing infrastructures. Scientists from these and other disciplines are exploiting new initiatives and partnering with computer scientists and each other to develop production-scale Data Grids.
A tightly coordinated set of projects has been established that together are developing and applying Data Grid concepts to problems of tremendous scientific importance, in such areas as high
energy physics, nuclear physics, astronomy, bioinformatics and climate science. These projects
include (1) the Particle Physics Data Grid (PPDG [21]), which is focused on the application of
Data Grid concepts to the needs of a number of U.S.-based high energy and nuclear physics experiments; (2) the Earth System Grid project [22], which is exploring applications in climate and
specific technical problems relating to request management; (3) the GriPhyN [23] project, which
plans to conduct extensive computer science research on Data Grid problems and develop general tools that will support the automatic generation and management of derived, or “virtual”,
data for four leading experiments in high energy physics, gravitational wave searches and astronomy; (4) the European Data Grid (EDG) project, which aims to develop an operational Data
Grid infrastructure supporting high energy physics, bioinformatics, and satellite sensing; (5) the
TeraGrid [24], which will provide a massive distributed computing and data resource connected
by ultra-high speed optical networks; and (6) the International Virtual Data Grid Laboratory
(iVDGL [25]), which will provide a worldwide set of resources for Data Grid tests by a variety
of disciplines. These projects have adopted the Globus [26] toolkit for their basic Grid infrastructure to speed the development of Data Grids. The Globus directors have played a leadership
role in establishing a broad national—and indeed international—consensus on the importance of
Data Grid concepts and on specifics of a Data Grid architecture. I will discuss these projects in
varying levels of detail in the rest of this section while leaving coordination and common architecture issues for the following section.
6.1
The TeraGrid
The TeraGrid Project [24] was recently funded by the National Science Foundation for $53M
over 3 years to construct a distributed supercomputing facility at four sites: the National Center
for Supercomputing Applications [27] in Illinois, the San Diego Supercomputing Center, Caltech’s Center for Advanced Computational Research and Argonne National Laboratory. The
project aims to build and deploy the world's largest, fastest, most comprehensive, distributed infrastructure for open scientific research. When completed, the TeraGrid will include 13.6 teraflops of Linux cluster computing power distributed at the four TeraGrid sites, facilities capable
of managing and storing more than 450 terabytes of data, high-resolution visualization environments, and toolkits for Grid computing. These components will be tightly integrated and connected through an optical network that will initially operate at 40 gigabits per second and later be
upgraded to 50-80 gigabits/second—an order of magnitude beyond today's fastest research network. TeraGrid aims to partner with other Grid projects and has active plans to help several discipline sciences deploy their applications, including dark matter calculations, weather forecasting, biomolecular electrostatics, and quantum molecular calculations [28].
12
6.2
Particle Physics Data Grid
The Particle Physics Data Grid (PPDG) is a collaboration of computer scientists and physicists
from six experiments who plan to develop, evaluate and deliver Grid-enabled tools for dataintensive collaboration in particle and nuclear physics. The project has been funded by the U.S.
Department of Energy since 1999 and was recently funded for over US$3.1M in 2001 (funding is
expected to continue at a similar level for 2002-2003) to establish a “collaboratory pilot” to pursue these goals. The new three-year program will exploit the strong driving force provided by
currently running high energy and nuclear physics experiments at SLAC, Fermilab and
Brookhaven together with recent advances in Grid middleware.
Novel mechanisms and policies will be vertically integrated with Grid middleware and experiment specific applications and computing resources to form effective end-to-end capabilities.
Our goals and plans are guided by the immediate, medium-term and longer-term needs and perspectives of the LHC experiments ATLAS [29] and CMS [30] that will run for at least a decade
from late 2005 and by the research and development agenda of other Grid-oriented efforts. We
exploit the immediate needs of running experiments – BaBar [31], D0 [32], STAR [33] and Jlab
[34] experiments – to stress-test both concepts and software in return for significant mediumterm benefits. PPDG is actively involved in establishing the necessary coordination between potentially complementary data-grid initiatives in the US, Europe and beyond.
The BaBar experiment faces the challenge of data volumes and analysis needs planned to grow
by more than a factor 20 by 2005. During 2001, the CNRS-funded computer center at CCIN2P3
Lyon, France will join SLAC in contributing data analysis facilities to the fabric of the collaboration. The STAR experiment at RHIC has already acquired its first data and has identified Grid
services as the most effective way to couple the facilities at Brookhaven with its second major
center for data analysis at LBNL. An important component of the D0 fabric is the SAM [35] distributed data management system at Fermilab, to be linked to applications at major US and international sites. The LHC collaborations have identified data-intensive collaboratories as a vital
component of their plan to analyze tens of petabytes of data in the second half of this decade. US
CMS is developing a prototype worldwide distributed data production system for detector and
physics studies.
6.3
GriPhyN Project
GriPhyN is a large collaboration of computer scientists and experimental physicists and astronomers who aim to provide the information technology (IT) advances required for petabyte-scale
data intensive sciences and implement production-scale Data Grids. Funded by the National Science Foundation for US$11.9M for 2000-2005, the project requirements are driven by four forefront experiments: the ATLAS [29] and CMS [30] experiments at the LHC, the Laser Interferometer Gravitational Wave Observatory (LIGO [36]) and the Sloan Digital Sky Survey (SDSS
[37]). These requirements, however, easily generalize to other sciences as well as 21st century
commerce, so the GriPhyN team is pursuing IT advances centered on the creation of Petascale
Virtual Data Grids (PVDG) that meet the data-intensive computational needs of a diverse community of thousands of scientists spread across the globe.
GriPhyN has adopted the concept of virtual data as a unifying theme for its investigations of Data Grid concepts and technologies. This term is used to refer to two related concepts: transparency with respect to location as a means of improving access performance, with respect to speed
and/or reliability, and transparency with respect to materialization, as a means of facilitating the
13
definition, sharing, and use of data derivation mechanisms. These characteristics combine to enable the definition and delivery of a potentially unlimited virtual space of data products derived
from other data. In this virtual space, requests can be satisfied via direct retrieval of materialized
products and/or computation, with local and global resource management, policy, and security
constraints determining the strategy used. The concept of virtual data recognizes that all except
irreproducible raw experimental data need ‘exist’ physically only as the specification for how
they may be derived. The grid may instantiate zero, one, or many copies of derivable data depending on probable demand and the relative costs of computation, storage, and transport. On a
much smaller scale, this dynamic processing, construction, and delivery of data is precisely the
strategy used to generate much, if not most, of the web content delivered in response to queries
today.
In order to realize these concepts, GriPhyN is conducting research into virtual data cataloging,
execution planning, execution management, and performance analysis issues (see Figure 5). The
results of this research, and other relevant technologies, are developed and integrated to form a
Virtual Data Toolkit (VDT). Successive VDT releases will be applied and evaluated in the context of the four partner experiments. VDT 1.0 was released in October 2001 and the next release
is expected early in 2002.
6.4
European Data Grid
The EU DataGrid project8 is a European Union collaboration of computer scientists and researchers from the LHC experiments, molecular genomics and earth observations whose goal is
to research and deploy Data Grid technologies in support of European science. EU DataGrid received 9.8M Euros in funding from 2001–2004 to carry out its mission. Their work is broken
into 12 work packages, of which the first nine involve technologies, networks and testbeds relating to LHC and two others are related to biology applications and satellite remote sensing.
The DataGrid initiative is led by CERN together with five other main partners and fifteen associated partners. The project brings together the following European leading research agencies: the
European Space Agency (ESA), France's Centre National de la Recherche Scientifique (CNRS),
Italy's Istituto Nazionale di Fisica Nucleare (INFN), the Dutch National Institute for Nuclear
Physics and High Energy Physics (NIKHEF) and UK's Particle Physics and Astronomy Research
Council (PPARC). The fifteen associated partners come from the Czech Republic, Finland,
France, Germany, Hungary, Italy, the Netherlands, Spain, Sweden and the United Kingdom.
Many other projects funded by EU nations work closely with DataGrid, tremendously increasing
its effectiveness. EU DataGrid collaborates closely with the U.S. GriPhyN and PPDG projects.
6.5
CrossGrid
The Cross Grid project will develop, implement and exploit new Grid components for interactive
compute and data intensive applications, including simulation and visualisation of surgical procedures, flooding simulations, team decision support systems, distributed data analysis in highenergy physics, air pollution and weather forecasting. The elaborated methodology, generic application architecture, programming environment, and new Grid services will be validated and
tested thoroughly on the CrossGrid testbed, with an emphasis on a user friendly environment.
The work will be done in close collaboration with the Grid Forum and the DataGrid project to
profit from their results and experience, and to obtain full interoperability. This will result in the
further extension of the Grid across eleven European countries.
14
6.6
International Virtual Data Grid Laboratory and DataTAG
The International Virtual Data Grid Laboratory (iVDGL) will comprise heterogeneous computing and storage resources in the U.S., Europe, Asia, Australia and South America linked by
high-speed networks. It will be operated as a single system for the purposes of interdisciplinary
experimentation in Grid-enabled data-intensive scientific computing. Laboratory users will include international scientific collaborations such as the Laser Interferometer Gravitational-wave
Observatory (LIGO), the ATLAS and CMS detectors at the Large Hadron Collider (LHC) at
CERN, the Sloan Digital Sky Survey (SDSS), and the U.S. National Virtual Observatory (NVO);
application groups affiliated with the NSF Supercomputer centers and EU projects; outreach activities; and Grid technology research efforts.
The laboratory itself will be created by deploying a carefully crafted data grid technology base
across an international set of sites, each of which provides substantial computing and storage capability accessible via iVDGL software. The 40+ sites, of varying sizes, will include U.S. sites
put in place specifically for the laboratory; sites contributed by EU, Japanese, Australian, and
potentially other international collaborators; existing facilities that are owned and managed by
the scientific collaborations; and facilities placed at outreach institutions. These sites will be
connected by very high speed national and transoceanic networks. Several Grid Operations Centers (GOCs), located in different countries, will provide the essential management and coordination elements required to ensure overall functionality and to reduce operational overhead on resource centers. The U.S. portion of iVDGL was funded by NSF for $13.65M for the period
2001–2006.
The DataTAG project, recently funded by the European Commission for 4M Euros, will complement iVDGL by providing a very high-performance transatlantic research network operating
initially at 2.5 Gbits/sec. is to create a large-scale intercontinental Grid testbed involving the
DataGRID [3,4] project, several national projects in Europe, and related Grid projects in the
USA. This will allow exploration of advanced networking technologies and interoperability issues between different Grid domains.
7
Common Infrastructure
Given the international nature of the experiments participating in these projects (some of them,
like the LHC experiments, are participating in several projects), There is widespread recognition
by scientists in these projects of the importance of developing common protocols and tools to
enable inter-Grid operation avoid costly duplication. Scientists from several of these projects are
8
Summary
Data Grid technologies embody entirely new approaches to the analysis of large data collections,
in which the resources of an entire scientific community are brought to bear on the analysis and
discovery process, and data products are made available to all community members, regardless
of location. Large interdisciplinary efforts recently funded in the U.S. and EU have begun research and development of the basic technologies required to create working Data Grids. Over
the coming years, they will deploy, evaluate, and optimize Data Grid technologies on a production scale, and integrate them into production applications. The experience gained with these
new information infrastructures, providing transparent managed access to massive distributed
data collections, will be applicable to large-scale data-intensive problems in a wide spectrum of
15
scientific and engineering disciplines, and eventually in industry and commerce. Such systems
will be needed in the coming decades as a central element of our information-based society.
References
1
“The Grid: Blueprint for a New Computer Architecture”, eds., Ian Foster and Carl Kesselman, http://www.amazon.com/exec/obidos/ASIN/1558604758/107-6767081-5745308.
2
Formerly known by the name “Computational Grid”, the term “Grid” reflects the fact that
the resources to be shared may be quite heterogeneous and have little to do with computing
per se.
3
I. Foster, C. Kesselman, S. Tuecke, “The Anatomy of the Grid: Enabling Virtual Scalable
Organizations”, International Journal of High Performance Computing Applications, 15(3),
200-222, 2001, http://www.globus.org/anatomy.pdf.
4
The total required CPU resources factor in inefficiencies, additional reconstruction passes
and simulations.
5
MONARC home page, http://monarc.web.cern.ch/MONARC/.
6
GriPhyN home page, http://www.griphyn.org/.
7
PPDG home page, http://www.ppdg.net/.
8
EU DataGrid home page, http://www.eu-datagrid.org/.
9
See the ICFA Network Task Force Requirements Report (1998),
http://l3www.cern.ch/~newman/icfareq98.html.
10 The U.S. CMS example of the hardware, services and manpower requirements for a Tier1
center, including a work breakdown structure for the Tier1 facility at Fermilab may be found
in the November 2000 report by V. O’Dell et al. “US-CMS User Facility Subproject”, at
http://home.fnal.gov/~bauerdic/uscms/scp/baserev_nov_2000/rcv304.pdf.
11 VRVS home page, http://vrvs.caltech.edu .
12 Livny, M. High-Throughput Resource Management. In Foster, I. and Kesselman, C. eds.
The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, 311337.
13 Moore, R., Baru, C., Marciano, R., Rajasekar, A. and Wan, M. Data-Intensive Computing,
in Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, 105-129.
14 Johnston, W.E., Gannon, D. and Nitzberg, B., Grids as Production Computing Environments: The Engineering Aspects of NASA's Information Power Grid. In Proc. 8th IEEE
Symposium on High Performance Distributed Computing, 1999, IEEE Press.
15 Stevens, R., Woodward, P., DeFanti, T. and Catlett, C. From the I-WAY to the National
Technology Grid. Communications of the ACM, 40(11):50-61. 1997.
16 Information Power Grid home page, http://www.ipg.nasa.gov/.
16
17 Beiriger, J., Johnson, W., Bivens, H., Humphreys, S. and Rhea, R., Constructing the ASCI
Grid. In Proc. 9th IEEE Symposium on High Performance Distributed Computing, 2000,
IEEE Press.
18 Fischer, P., McKay, T.A., Sheldon, E., Connolly, A., Stebbins, A. and the SDSS collaboration: Weak Lensing with SDSS Commissioning Data: The Galaxy-Mass Correlation Function To 1/h Mpc, Astron.J. in press, 2000.
19 Luppino, G.A. and Kaiser, N., Ap.J. 475, 20, 1997.
20 Tyson, J.A., Kochanek, C. and Dell’Antonio, I.P., Ap.J., 498, L107, 1998.
21 PPDG home page, http://www.ppdg.net/.
22 Earth System Grid homepage, http://www.earthsystemgrid.org/.
23 GriphyN Project home page, http://www.griphyn.org/.
24 TeraGrid home page, http://www.teragrid.org/.
25 International Virtual Data Grid Laboratory home page, http://www.ivdgl.org/.
26 Globus home page, http://www.globus.org/.
27 NCSA home page at http://www.ncsa.edu/.
28 These applications are described more fully at http://www.teragrid.org/about_faq.html.
29 The ATLAS Experiment, A Toroidal LHC ApparatuS,
http://atlasinfo.cern.ch/Atlas/Welcome.html
30 The CMS Experiment, A Compact Muon Solenoid, http://cmsinfo.cern.ch/Welcome.html
31 BaBar, http://www.slac.stanford.edu/BFROOT
32 D0, http://www-d0.fnal.gov/
33 STAR, http://www.star.bnl.gov/
34 Jlab experiments, http://www.jlab.org/
35 SAM, http://d0db.fnal.gov/sam/
36 LIGO home page, http://www.ligo.caltech.edu/.
37 SDSS home page, http://www.sdss.org/.
17
Download