Paul Avery Nov. 10, 2001 Version 5 Data Grids: A New Computational Infrastructure for Data Intensive Science Abstract Twenty-first century scientific and engineering enterprises are increasingly characterized by their geographic dispersion and their reliance on large data archives. These characteristics bring with them unique challenges. First, the increasing size and complexity of modern data collections require significant investments in information technologies to store, retrieve and analyze them. Second, the increased distribution of people and resources in these projects has made resource sharing and collaboration across significant geographic and organizational boundaries critical to their success. In this paper I explore how computing infrastructures based on Data Grids offer data intensive enterprises a comprehensive, scalable framework for collaboration and resource sharing. A detailed example of a Data Grid framework is presented for a Large Hadron Collider experiment, where a hierarchical set of laboratory and university resources comprising petaflops of processing power and a multipetabyte data archive must be efficiently utilized by a global collaboration. The experience gained with these new information systems, providing transparent managed access to massive distributed data collections, will be applicable to large-scale data-intensive problems in a wide spectrum of scientific and engineering disciplines, and eventually in industry and commerce. Such systems will be needed in the coming decades as a central element of our information-based society. Keywords: Grids, data, virtual data, data intensive, petabyte, petascale, petaflop, virtual organization, griphyn, ppdg, cms, lhc, cern. 1 DATA GRIDS: A NEW COMPUTATIONAL INFRASTRUCTURE FOR DATA INTENSIVE SCIENCE ................................................................................................................ 1 1 INTRODUCTION................................................................................................................. 3 2 DATA INTENSIVE ACTIVITIES...................................................................................... 4 3 DATA GRIDS AND DATA INTENSIVE SCIENCES...................................................... 5 4 DATA GRID DEVELOPMENT FOR THE LARGE HADRON COLLIDER .............. 6 4.1 4.2 4.3 4.4 5 DATA GRID ARCHITECTURE ...................................................................................... 10 5.1 5.2 5.3 6 THE CMS DATA GRID ..................................................................................................... 7 THE TIER CONCEPT .......................................................................................................... 8 ELABORATION OF THE CMS DATA GRID MODEL ............................................................ 8 ADVANTAGES OF THE CMS DATA GRID MODEL ............................................................. 9 GLOBUS-BASED INFRASTRUCTURE ................................................................................ 10 VIRTUAL DATA .............................................................................................................. 10 DEVELOPMENT OF DATA GRID ARCHITECTURE ............................................................. 11 EXAMPLES OF MAJOR DATA GRID EFFORTS TODAY ....................................... 11 6.1 6.2 6.3 6.4 6.5 6.6 THE TERAGRID .............................................................................................................. 12 PARTICLE PHYSICS DATA GRID ..................................................................................... 13 GRIPHYN PROJECT ........................................................................................................ 13 EUROPEAN DATA GRID .................................................................................................. 14 CROSSGRID .................................................................................................................... 14 INTERNATIONAL VIRTUAL DATA GRID LABORATORY AND DATATAG ......................... 15 7 COMMON INFRASTRUCTURE ..................................................................................... 15 8 SUMMARY ......................................................................................................................... 15 REFERENCES ............................................................................................................................ 16 2 1 Introduction Twenty-first century scientific and engineering enterprises are increasingly characterized by their geographic dispersion and their reliance on large data archives. These characteristics bring with them unique challenges. First, the increasing size and complexity of modern data collections require significant investments in information technologies to store, retrieve and analyze them. Second, the increased distribution of people and resources in these projects has made resource sharing and collaboration across significant geographic and organizational boundaries critical to their success. Infrastructures known as “Grids”[1,2] are being developed to address the problem of resource sharing. An excellent introduction to Grids can be found in the article [3], “The Anatomy of the Grid”, which provides the following interesting description: “The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO).” The existence of very large distributed data collections adds a significant new dimension to enterprise-wide resource sharing, and has led to substantial research and development effort on “Data Grid” infrastructures, capable of supporting this more complex collaborative environment. This work has taken on more urgency for new scientific collaborations, which in some cases will reach global proportions and share data archives with sizes measured in dozens or even hundreds of petabytes within a decade. These collaborations have recognized the strategic importance of Data Grids for realizing the scientific potential of their experiments, and have begun working with computer scientists, members of other scientific and engineering fields and industry to research and develop this new technology and create production-scale computational environments. Figure 1 shows a U.S. based Data Grid consisting of a number of heterogeneous resources. My aim in this paper is to review Data Grid technologies and how they can benefit data intensive sciences. Developments in industry are not included here, but since most Data Grid work is presently carried out to address the urgent data needs of advanced scientific experiments, the omission is not a serious one. (The problems solved while dealing with these experiments will in any case be of enormous benefit to industry in a short time.) Furthermore, I will concentrate on those projects which are developing Data Grid infrastructures for a variety of disciplines, rather than “vertically integrated” projects that benefit a single experiment or discipline, and explain the specific challenges faced by those disciplines. 3 2 Data intensive activities The number and diversity of data intensive projects is expanding rapidly. The following recounting of projects is presented as a survey that, while incomplete, shows the scope of and immense interest in data intensive methods in solving scientific problems. Physics and space sciences: High energy and nuclear physics experiments at accelerator laboratories at Fermilab, Brookhaven and SLAC already generate dozens to hundreds of terabytes of colliding beam data per year that is distributed to and analyzed by hundreds of physicists around the world to search for subtle new interactions. Upgrades to these experiments and new experiments planned for the Large Hadron Collider at CERN will increase data rates to petabytes per year. Gravitational wave searches at LIGO, VIRGO and GEO will accumulate yearly samples of approximately 100 terabytes of mostly environmental and calibration data that must be correlated and filtered to search for rare gravitational events. New multi-wavelength all-sky surveys utilizing telescopes instrumented with gigapixel CCD arrays will soon drive yearly data collection rates from terabytes to petabytes. Similarly, remote-sensing satellites operating at multiple wavelengths will generate several petabytes of spatial-temporal data that can be studied by researchers to accurately measure changes in our planet’s support systems. Biology and medicine: Biology and medicine are rapidly increasing their dependence on data intensive methods. New generation X-ray sources coupled with improved data collection methods are expected to generate copious data samples from biological specimens with ultra-short time resolutions. Many organism genomes are been sequenced by extremely fast “shotgun” methods requiring enormous computational power. The resulting flow of genome data is rising exponentially, with standard databases doubling in size every few months. Sophisticated searches of these databases will require much larger computational resources as well as new statistical methods and metadata to keep up with the flow of data. Proteomics, the study of protein structure and function, is expected to generate enormous amounts of data, easily dwarfing the data samples obtained from genome studies. When applied to protein-protein interactions of extreme importance to drug designers these studies will require additional orders of magnitude increases in computational capacity (to hundreds of petaflops) and storage sizes (to thousands of petabytes). In medicine, a single three dimensional brain scan can generate a significant fraction of a terabyte of data, while systematic adoption of digitized radiology scans will produce dozens of petabytes of data that can be quickly accessed and searched for breast cancer and other diseases. Exploratory studies have shown the value of converting patient records to electronic form and attaching digital CAT scans, X-Ray charts and other instrument data, but systematic use of such methods would generate databases many petabytes in size. Medical data pose additional ethical and technical challenges stemming from exacting security restrictions on access to this data and patient identification. Computer simulations: Advances in information technology in recent years have given scientists and engineers the ability to develop sophisticated simulation and modeling techniques for improved understanding of the behavior of complex systems. When coupled to the huge processing power and storage resources available in supercomputers or large computer clusters, these advanced simulation and modeling methods become tools of rare power, permitting detailed and rapid studies of physical processes while sharply reducing the need to conduct lengthy and costly experiments or to build expensive prototypes. The following examples provide a hint of the potential of modern simulation methods. High energy and nuclear physics experiments routinely generate simulated datasets whose size (in the multi-terabyte range) is comparable to 4 and sometimes exceeds the raw data collected by the same experiment. Supercomputers generate enormous databases from long-term simulations of climate systems with different parameters that can be compared with one another and with remote satellite sensing data. Environmental modeling of bays and estuaries using fine-scale fluid dynamics calculations generates massive datasets that permit the calculation of pollutant dispersal scenarios under different assumptions that can be compared with measurements. These projects also have geographically distributed user communities who must access and manipulate these databases. 3 Data Grids and Data Intensive Sciences To develop the argument that Data Grids offer a comprehensive solution to data intensive activities, I first summarize some general features of Grid technologies. These technologies comprise a mixture of protocols, services, and tools that are collectively called “middleware”, reflecting the fact that they are accessed by “higher level” applications or application tools while they in turn invoke processing, storage, network and other services at “lower” software and hardware levels. Grid middleware includes security and policy mechanisms that work across multiple institutions; resource management tools that support access to remote information resources and simultaneous allocation (“co-allocation”) of multiple resources; general information protocols and services that provide important status information about hardware and software resources, site configurations, and services; and data management tools that locate and transport datasets between storage systems and applications. The diagram in Figure 2 outlines in a simple way the roles played by these various Grid technologies. The lowest level Fabric contains shared resources such as computer clusters, data storage systems, catalogs, networks, etc. that Grid tools must access and manipulate. The Resource and Connectivity layers provide, respectively, access to individual resources and the communication and authentication tools needed to communicate with them. Coordinated use of multiple resources – possibly at different sites – is handled by Collective protocols, APIs, and services. Applications and application toolkits utilize these Grid services in myriad ways to provide “Gridaware” services for members of a particular virtual organization. A much more detailed explication of Grid architecture can be found in reference [3]. While standard Grid infrastructures provide distributed scientific communities the ability collaborate and share resources, additional capabilities are needed to cope with the specific challenges associated with scientists accessing and manipulating very large distributed data collections. These collections, ranging in size from terabytes (TB) to petabytes (PB), comprise raw (measured) and many levels of processed or refined data as well as comprehensive metadata describing, for example, under what conditions the data was generated or collected, how large it is, etc. New protocols and services must facilitate access to significant tertiary (e.g., tape) and secondary (disk) storage repositories to allow efficient and rapid access to primary data stores, while taking advantage of disk caches that buffer very large data flows between sites. They also must make efficient use of high performance networks that are critically important for the timely completion of these transfers. Thus to transport 10 TB of data to a computational resource in a single day requires a 1 Gigabit per second network operated at 100% utilization. Efficient use of these extremely high network bandwidths also requires special software interfaces and programs that in most cases have yet to be developed. The computational and data management problems encountered in data intensive research include the following challenging aspects: 5 Computation-intensive as well as data-intensive: Analysis tasks are compute-intensive and data-intensive and can involve hundreds or even thousands of computer, data handling, and network resources. The central problem is coordinated management of computation and data, not just data curation and movement. Need for large-scale coordination without centralized control: Rigorous performance goals require coordinated management of numerous resources, yet these resources are, for both technical and strategic reasons, highly distributed and not always amenable to tight centralized control. Large dynamic range in user demands and resource capabilities: It must be possible to support and arbitrate among a complex task mix of experiment-wide, group-oriented, and (perhaps thousands of) individual activities—using I/O channels, local area networks, and wide area networks that span several distance scales. Data and resource sharing: Large dynamic communities would like to benefit from the advantages of intra and inter community sharing of data products and the resources needed to produce and store them. The “Data Grid” has been introduced as a unifying concept to describe the new technologies required to support such next-generation data-intensive applications—technologies that will be critical to future data-intensive computing in the many areas of science and commerce in which sophisticated software must harness large amounts of computing, communication and storage resources to extract information from data. Data Grids are typically characterized by the following elements: (1) they have large extent (national and even global) and scale (many resources and users); (2) they layer sophisticated new services on top of existing local mechanisms and interfaces, facilitating coordinated sharing of remote resources; and (3) they provide a new dimension of transparency in how computational and data processing are integrated to provide data products to user applications. This transparency is vitally important for sharing heterogeneous distributed resources in a manageable way, a point to which I return later. 4 Data Grid Development for the Large Hadron Collider I now turn to a particular experimental activity, high energy physics at the Large Hadron Collider (LHC) at CERN, to provide a concrete example of how a Data Grid computing framework might be implemented to meet the demanding computational and collaborative needs of a data intensive experiment. The extreme requirements of experiments at the LHC, due to commence operations in 2006, have been known to physicists for several years. Distributed architectures based on Data Grid technologies have been proposed as solutions, and a number of initiatives are now underway to develop implementations at increasing levels of scale and complexity. The particular architectural solution shown here is not necessarily the most effective or efficient one for other disciplines, but it does offer some valuable insights about the technical and even political merits of Data Grids when applied to large-scale problems. For definitiveness, and without sacrificing much generality, I focus on the CMS [30] experiment at the LHC. CMS faces computing challenges of unprecedented scale in terms of data volume, processing requirements, and the complexity and distributed nature of the analysis and simulation tasks among thousands of physicists worldwide. The detector will record events* at a rate of * An “event” is the collision of two energetic beam particles that produces dozens or hundreds of secondary particles. 6 approximately 100 Hz, accumulating 100 MB/sec of raw data or several petabytes per year of raw and processed data in the early years of operation. The data storage rate is expected to grow in response to the pressures of higher luminosity, new physics triggers and better storage capabilities, leading to data collections of approximately 20-30 PB by 2010 rising to several hundred petabytes over the following decade. The computational resources required to reconstruct and simulate this data are similarly vast. Each raw event will be approximately 1 MB in size and require roughly 3000 SI95-sec* to reconstruct and 5000 SI95-sec to fully simulate. Estimates based on the initial 100 Hz data rate have yielded a total computing capability of approximately 1800K SI95 in the first year of operation, rising rapidly to keep up with the total accumulated data. Approximately one third of this capability will reside at CERN and the remaining two thirds will be provided by the member nations [4]. To put the numbers in context, 1800K SI95s is roughly equivalent to 60,000 of today’s high-end PCs. 4.1 The CMS Data Grid The challenge facing CMS is how to build an infrastructure that will provide these computational and storage resources and permit their effective and efficient use by a scientific community of several thousand physicists spread across the globe. Simple scaling arguments as well as more sophisticated studies using the MONARC [5] simulation package have shown that a distributed infrastructure based on regional centers provides the most effective technical solution, since large geographical clusters of users are close to the datasets and resources that they employ. A distributed configuration is also preferred from a political perspective since it allows local control of resources and some degree of autonomy in pursuing physics objectives. As discussed in the previous section, the distributed computing model can be made even more productive by arranging CMS resources as a Grid, specifically a Data Grid in which large computational resources and massive data collections (including cached copies and data catalogs) linked by very high-speed networks and sophisticated middleware software form a single computational resource accessible from any location. The Data Grid framework is expected to play a key role in realizing CMS’ scientific potential by transparently mobilizing large-scale computing resources for large-scale computing tasks and by providing a collaboration-wide computing fabric that permits full participation in the CMS research program by physicists at their home institutes. This latter point is particularly relevant for participants in remote or distant regions. As a result, a highly distributed, hierarchical computing infrastructure exploiting Grid technologies is a central element of the CMS worldwide computing model. It should be noted that while LHC data taking will not begin until 2006, CMS already has large and highly distributed computing and software operations. These operations serve immediate and near term needs such as test beam data analysis, detector performance and physics simulation studies that support detector design and optimization, software development and associated scalability studies, and “Data Challenges” involving high-throughput, high-volume stress tests of offline processing software and facilities. These early needs require Grid R&D, software development, engineering tests and staged deployment of significant computing resources (including prototypes) to take place throughout the 2001–2006 period. * SI95 is an abbreviation for SpecInt95, a standard unit of processor non-floating point speed. 7 The CMS software and computing resources will be coherently shared and managed as a Data Grid, following the ideas developed by the GriPhyN [6], Particle Physics Data Grid [7] (PPDG) and European DataGrid [8] projects. The distributed nature of the system allows one to take advantage of the funding and manpower resources available in each region, and to balance proximity of large datasets to appropriately large centralized processing resources against proximity of smaller frequently accessed data to the end users, and to take advantage of support and expertise in different time zones for efficient data analysis. 4.2 The Tier Concept The computing resources of CMS will be arranged in a “hierarchy” of five levels, interconnected by high-speed regional, national and international networks: Tier0: The central facility at CERN where the experimental data is taken, and where all the raw data is stored and initially reconstructed. Tier1: A major national center supporting the full range of computing, data handling and support services required for effective analysis by a community of several hundred physicists. Much of the re-reconstruction as well as the analysis of the data will be carried out at these centers, which for the most part will be located at national laboratories. Tier2: A smaller system supporting analysis and reconstruction on demand by a community of typically 30–50 physicists, sited at a university or research laboratory. Tier3: A workgroup cluster specific to a university department or a single high energy physics group, of the sort traditionally used to support local needs for data analysis. Tier4: An access device such as individual user’s desktop, laptop or even mobile device. Figure 3 shows a simplified view of the architecture of the CMS Data Grid system and the data rates involved. The general concept of the distributed computing model is well illustrated here, though the network bandwidths shown are approximate and are the subject of ongoing study. [9] The partitioning of resources is expected to be roughly 1:1:1 among the CERN Tier0 center, the sum of the Tier1 centers, and the sum of the larger number of Tier2 centers. In practice this means that each Tier1 should have approximately 15-20 percent of the capability of the Tier0 and each Tier2 should have approximately 20 percent of the capability of a Tier1. It is expected that the worldwide CMS distributed computing system there will have approximately 5-6 Tier1 laboratories. 4.3 Elaboration of the CMS Data Grid Model The purpose of the Tier1 and Tier2 centers is to provide the enabling infrastructure to allow physicists in each world region to participate fully in the physics program of their experiment, including when they are at their home institutions. The enabling infrastructure consists of the software, computing and data handling hardware, and the support services needed to access, analyze, manage and understand the experimental data. The Tier1 national centers are foreseen to provide a full range of data handling as well as computational and user support services, and each will serve a community of approximately 200-500 physicists [10]. The Tier1 centers will be sited in the U.S., France, Germany, Italy, the UK, and Russia. Unlike the U.S. center, the European sites will serve multiple LHC experiments. 8 Several models for Tier2 operation have been discussed. Typically, a Tier2 center would be configured to serve a physics community of roughly 30–50 physicists, using a cost-effective cluster of rack mounted computational devices with sufficient memory to run large applications, interconnected with commodity (e.g., Ethernet) networking, for simulation production, reconstruction and analysis. Tier2 sites will be capable of contributing to the distributed production of reconstructed data on a small to medium scale, and will require relatively modest local manpower for system operation. Data that is to be served with high availability and performance is stored in multi-terabyte RAID arrays. A small tape library is used for tape import/export, and may also be used to backup data and software, especially if there is no large archival tape system at the Tier 2 site. Tier2 centers will, except in a few cases, lack the full range of data handling and support services that are required at a Tier1 center. Each center may help to serve one region or a functional subgroup within a large country, working in partnership with the national Tier1 center, or it may serve a small country whose national needs do not require a Tier1. A strong design consideration is that Tier2 sites should have the flexibility to respond more quickly than the productionoriented Tier1 laboratories to software tests and changes in physics priorities and their hardware configurations should be chosen so as to be manageable by a small support team. A Tier3 site consists of a workgroup cluster specific to a university department or a single high energy physics group, of the sort traditionally used to support local needs for data analysis, typically serving 5–15 physicists. The scale of these systems is expected to be such that their use for organized production may be limited in general (though some universities may have substantial resources that could be harnessed for large productions), and restricted to low-IO applications such as simulations and small analyses. 4.4 Advantages of the CMS Data Grid Model The Tier2 centers in the distributed computing model bring with them a number of other advantages, apart from that of gathering and managing resources worldwide to meet the aggregate computing needs. First, by balancing on-demand local use and centrally coordinated “production” use, each experiment can ensure that individual scientists and small workgroups have the means at their disposal to develop new software and new lines of analysis efficiently. In many cases, siting frequently accessed data close to the users leads to more rapid turnaround, as a result of the higher throughput and shorter delays achievable over local area and/or shorter wide area network links. Second, situating Tier2 centers at universities and multi-purpose research laboratories combines the research and education mission, making students and young scientists part of the ongoing process of exploration and discovery, including many who will rarely be able to visit the central laboratory. Third, by placing one or more appropriately-scaled centers in each world region, the scope of the analysis will lead to an increased number of intellectual focal points for student-faculty interaction and mentoring. The distributed nature of the Tier2s and their flexible balance between ondemand use by local and regional groups and organized use coordinated by the Tier1 center will also lead to new modes of daily partnership between the universities and laboratories, with greater continuity and less reliance on people’s travel schedules as well as more effective and creative collaborative work among the member of small teams. 9 Finally, CMS plans to deploy next-generation remote collaborative systems [11] at each participating institution as a way of ensuring coherency among research efforts. These systems include videoconferencing, but also encompass a wider set of tools such as document sharing, whiteboards, remote test beam and instrument control and remote visualization. Such systems and the protocols needed to support them are under active development by research groups and industry and are being deployed in a variety of settings. The group-oriented analysis activities surrounding periodic “data challenges” and other experiment wide milestones will help CMS optimize these collaborative systems prior to the arrival of real data in 2006. 5 5.1 Data Grid Architecture Globus-Based Infrastructure The Globus Project [26], a joint research and development effort of Argonne National Laboratory, the Information Sciences Institute of the University of Southern California, and the University of Chicago, has over the past several years developed the most comprehensive and widely used Grid framework available today. The widespread adoption of Globus technologies is due in large measure to the fact that its members work closely with a variety of scientific and engineering projects while maintaining compatibility with other computer science toolkits used by these projects. Globus tools are being studied, developed, and enhanced at institutions worldwide to create new Grids and services, and to conduct computing research. Globus components provide the capabilities to create Grids of computing resources and users; track the capabilities of resources within a Grid; specify the resource needs of user’s computing tasks; mutually authenticate both users and resources; and deliver data to and from remotely executed computing tasks. Globus is distributed in a modular and open toolkit form, facilitating the incorporation of additional services into scientific environments and applications. For example, Globus has been integrated with technologies such as the Condor high-throughput computing environment [12,13] and the Portable Batch System (PBS) job scheduler [14]. Both of these integrations demonstrate the power and value of the open protocol toolkit approach. Experience with projects such as the NSF’s National Technology Grid [15], NASA’s Information Power Grid [16] and DOE ASCI’s Distributed Resource Management Project [17] have provided considerable experience with the creation of production infrastructures. 5.2 Virtual Data Virtual data is of key importance for exploiting the full potential of Data Grids. The concept actually refers to two related ideas: transparency with respect to location, where data might be stored in a number of different locations, including caches, to improve access speed and/or reliability, and transparency with respect to materialization, where the data can be computed by an algorithm, to facilitate defining, sharing, and use of data derivation mechanisms. Both characteristics enable the definition and delivery of a potentially unlimited virtual space of data products derived from other data. In this virtual space, requests can be satisfied via direct retrieval of materialized products and/or computation, with local and global resource management, policy, and security constraints determining the strategy used. The concept of virtual data recognizes that all except irreproducible raw experimental data need ‘exist’ physically only as the specification for how they may be derived. Grid infrastructures may instantiate zero, one, or many copies of derivable data depending on probable demand and the relative costs of computation, storage, and transport. On a much smaller scale, this dynamic processing, construction, and delivery of data 10 is precisely the strategy used to generate much, if not most, of the web content delivered in response to queries today. Figure 4 provides a simple illustration of how virtual data might be computed or fetched from local, region or large computing facilities. Consider an astronomer using SDSS to investigate correlations in galaxy orientation due to lensing effects by intergalactic dark matter [18,19,20]. A large number of galaxies—some 107— must be analyzed to get good statistics, with careful filtering to avoid bias. For each galaxy, the astronomer must first obtain an image, a few pixels on a side; process it in a computationally intensive analysis; and store the results. Execution of this request involves virtual data catalog accesses to determine whether the required analyses have been previously constructed. If they have not, the catalog must be accessed again to locate the applications needed to perform the transformation and to determine whether the required raw data is located in network cache, remote disk systems, or deep archive. Appropriate computer, network, and storage resources must be located and applied to access and transfer raw data and images, produce the missing images, and construct the desired result. The execution of this single request may involve thousands of processors and the movement of terabytes of data among archives, disk caches, and computer systems nationwide. Virtual data grid technologies are expected to be of immediate benefit to numerous other scientific and engineering application areas. For example, NSF and NIH fund scores of X-ray crystallography labs that together are generating many terabytes (soon petabytes) of molecular structure data each year. Only a small fraction of this data is being shared via existing publication mechanisms. Similar observations can be made concerning long-term seismic data generated by geologists, data synthesized from studies of the human genome database, brain imaging data, output from long-duration, high-resolution climate model simulations, and data produced by NASA’s Earth Observing System. 5.3 Development of Data Grid Architecture As described in the next Section, a number of scientific collaborations are developing and deploying Data Grids, a fact that has led Grid researchers to look at general approaches to developing Data Grid architectures. A Data Grid architecture must have a highly modular structure, defining a variety of components, each with its own protocol and/or application programmer interfaces (APIs). These various components also should be designed for use in an integrated fashion, as part of a comprehensive Data Grid architecture. The modular structure facilitates integration with discipline-specific systems that already have substantial investments in data management systems, a consequence of their need to provide high-performance data services top their users. 6 Examples of Major Data Grid Efforts Today I describe in this section some of the major projects that have been undertaken to develop Data Grids. Without exception, these efforts have been driven by the overwhelming and near-term needs of current and planned scientific experiments, and have led to fruitful collaborations between application scientists and computer scientists. High-energy physics has been at the forefront of this activity, a fact that can be attributed to its historical standing as both a highly data intensive and highly collaborative discipline. As noted in the last section, however, the existing high-energy physics computing infrastructures are not scalable to upgraded experiments at SLAC, Fermilab and Brookhaven, and new experiments at the LHC, that will generate petabytes 11 of data per year and be analyzed by global collaborations. Participants in other planned experiments in nuclear physics, gravitational research, large digital sky surveys, and virtual astronomical observatories, fields which also face challenges associated with massive distributed data collections and a dispersed user community, have also decided to adopt Data Grid computing infrastructures. Scientists from these and other disciplines are exploiting new initiatives and partnering with computer scientists and each other to develop production-scale Data Grids. A tightly coordinated set of projects has been established that together are developing and applying Data Grid concepts to problems of tremendous scientific importance, in such areas as high energy physics, nuclear physics, astronomy, bioinformatics and climate science. These projects include (1) the Particle Physics Data Grid (PPDG [21]), which is focused on the application of Data Grid concepts to the needs of a number of U.S.-based high energy and nuclear physics experiments; (2) the Earth System Grid project [22], which is exploring applications in climate and specific technical problems relating to request management; (3) the GriPhyN [23] project, which plans to conduct extensive computer science research on Data Grid problems and develop general tools that will support the automatic generation and management of derived, or “virtual”, data for four leading experiments in high energy physics, gravitational wave searches and astronomy; (4) the European Data Grid (EDG) project, which aims to develop an operational Data Grid infrastructure supporting high energy physics, bioinformatics, and satellite sensing; (5) the TeraGrid [24], which will provide a massive distributed computing and data resource connected by ultra-high speed optical networks; and (6) the International Virtual Data Grid Laboratory (iVDGL [25]), which will provide a worldwide set of resources for Data Grid tests by a variety of disciplines. These projects have adopted the Globus [26] toolkit for their basic Grid infrastructure to speed the development of Data Grids. The Globus directors have played a leadership role in establishing a broad national—and indeed international—consensus on the importance of Data Grid concepts and on specifics of a Data Grid architecture. I will discuss these projects in varying levels of detail in the rest of this section while leaving coordination and common architecture issues for the following section. 6.1 The TeraGrid The TeraGrid Project [24] was recently funded by the National Science Foundation for $53M over 3 years to construct a distributed supercomputing facility at four sites: the National Center for Supercomputing Applications [27] in Illinois, the San Diego Supercomputing Center, Caltech’s Center for Advanced Computational Research and Argonne National Laboratory. The project aims to build and deploy the world's largest, fastest, most comprehensive, distributed infrastructure for open scientific research. When completed, the TeraGrid will include 13.6 teraflops of Linux cluster computing power distributed at the four TeraGrid sites, facilities capable of managing and storing more than 450 terabytes of data, high-resolution visualization environments, and toolkits for Grid computing. These components will be tightly integrated and connected through an optical network that will initially operate at 40 gigabits per second and later be upgraded to 50-80 gigabits/second—an order of magnitude beyond today's fastest research network. TeraGrid aims to partner with other Grid projects and has active plans to help several discipline sciences deploy their applications, including dark matter calculations, weather forecasting, biomolecular electrostatics, and quantum molecular calculations [28]. 12 6.2 Particle Physics Data Grid The Particle Physics Data Grid (PPDG) is a collaboration of computer scientists and physicists from six experiments who plan to develop, evaluate and deliver Grid-enabled tools for dataintensive collaboration in particle and nuclear physics. The project has been funded by the U.S. Department of Energy since 1999 and was recently funded for over US$3.1M in 2001 (funding is expected to continue at a similar level for 2002-2003) to establish a “collaboratory pilot” to pursue these goals. The new three-year program will exploit the strong driving force provided by currently running high energy and nuclear physics experiments at SLAC, Fermilab and Brookhaven together with recent advances in Grid middleware. Novel mechanisms and policies will be vertically integrated with Grid middleware and experiment specific applications and computing resources to form effective end-to-end capabilities. Our goals and plans are guided by the immediate, medium-term and longer-term needs and perspectives of the LHC experiments ATLAS [29] and CMS [30] that will run for at least a decade from late 2005 and by the research and development agenda of other Grid-oriented efforts. We exploit the immediate needs of running experiments – BaBar [31], D0 [32], STAR [33] and Jlab [34] experiments – to stress-test both concepts and software in return for significant mediumterm benefits. PPDG is actively involved in establishing the necessary coordination between potentially complementary data-grid initiatives in the US, Europe and beyond. The BaBar experiment faces the challenge of data volumes and analysis needs planned to grow by more than a factor 20 by 2005. During 2001, the CNRS-funded computer center at CCIN2P3 Lyon, France will join SLAC in contributing data analysis facilities to the fabric of the collaboration. The STAR experiment at RHIC has already acquired its first data and has identified Grid services as the most effective way to couple the facilities at Brookhaven with its second major center for data analysis at LBNL. An important component of the D0 fabric is the SAM [35] distributed data management system at Fermilab, to be linked to applications at major US and international sites. The LHC collaborations have identified data-intensive collaboratories as a vital component of their plan to analyze tens of petabytes of data in the second half of this decade. US CMS is developing a prototype worldwide distributed data production system for detector and physics studies. 6.3 GriPhyN Project GriPhyN is a large collaboration of computer scientists and experimental physicists and astronomers who aim to provide the information technology (IT) advances required for petabyte-scale data intensive sciences and implement production-scale Data Grids. Funded by the National Science Foundation for US$11.9M for 2000-2005, the project requirements are driven by four forefront experiments: the ATLAS [29] and CMS [30] experiments at the LHC, the Laser Interferometer Gravitational Wave Observatory (LIGO [36]) and the Sloan Digital Sky Survey (SDSS [37]). These requirements, however, easily generalize to other sciences as well as 21st century commerce, so the GriPhyN team is pursuing IT advances centered on the creation of Petascale Virtual Data Grids (PVDG) that meet the data-intensive computational needs of a diverse community of thousands of scientists spread across the globe. GriPhyN has adopted the concept of virtual data as a unifying theme for its investigations of Data Grid concepts and technologies. This term is used to refer to two related concepts: transparency with respect to location as a means of improving access performance, with respect to speed and/or reliability, and transparency with respect to materialization, as a means of facilitating the 13 definition, sharing, and use of data derivation mechanisms. These characteristics combine to enable the definition and delivery of a potentially unlimited virtual space of data products derived from other data. In this virtual space, requests can be satisfied via direct retrieval of materialized products and/or computation, with local and global resource management, policy, and security constraints determining the strategy used. The concept of virtual data recognizes that all except irreproducible raw experimental data need ‘exist’ physically only as the specification for how they may be derived. The grid may instantiate zero, one, or many copies of derivable data depending on probable demand and the relative costs of computation, storage, and transport. On a much smaller scale, this dynamic processing, construction, and delivery of data is precisely the strategy used to generate much, if not most, of the web content delivered in response to queries today. In order to realize these concepts, GriPhyN is conducting research into virtual data cataloging, execution planning, execution management, and performance analysis issues (see Figure 5). The results of this research, and other relevant technologies, are developed and integrated to form a Virtual Data Toolkit (VDT). Successive VDT releases will be applied and evaluated in the context of the four partner experiments. VDT 1.0 was released in October 2001 and the next release is expected early in 2002. 6.4 European Data Grid The EU DataGrid project8 is a European Union collaboration of computer scientists and researchers from the LHC experiments, molecular genomics and earth observations whose goal is to research and deploy Data Grid technologies in support of European science. EU DataGrid received 9.8M Euros in funding from 2001–2004 to carry out its mission. Their work is broken into 12 work packages, of which the first nine involve technologies, networks and testbeds relating to LHC and two others are related to biology applications and satellite remote sensing. The DataGrid initiative is led by CERN together with five other main partners and fifteen associated partners. The project brings together the following European leading research agencies: the European Space Agency (ESA), France's Centre National de la Recherche Scientifique (CNRS), Italy's Istituto Nazionale di Fisica Nucleare (INFN), the Dutch National Institute for Nuclear Physics and High Energy Physics (NIKHEF) and UK's Particle Physics and Astronomy Research Council (PPARC). The fifteen associated partners come from the Czech Republic, Finland, France, Germany, Hungary, Italy, the Netherlands, Spain, Sweden and the United Kingdom. Many other projects funded by EU nations work closely with DataGrid, tremendously increasing its effectiveness. EU DataGrid collaborates closely with the U.S. GriPhyN and PPDG projects. 6.5 CrossGrid The Cross Grid project will develop, implement and exploit new Grid components for interactive compute and data intensive applications, including simulation and visualisation of surgical procedures, flooding simulations, team decision support systems, distributed data analysis in highenergy physics, air pollution and weather forecasting. The elaborated methodology, generic application architecture, programming environment, and new Grid services will be validated and tested thoroughly on the CrossGrid testbed, with an emphasis on a user friendly environment. The work will be done in close collaboration with the Grid Forum and the DataGrid project to profit from their results and experience, and to obtain full interoperability. This will result in the further extension of the Grid across eleven European countries. 14 6.6 International Virtual Data Grid Laboratory and DataTAG The International Virtual Data Grid Laboratory (iVDGL) will comprise heterogeneous computing and storage resources in the U.S., Europe, Asia, Australia and South America linked by high-speed networks. It will be operated as a single system for the purposes of interdisciplinary experimentation in Grid-enabled data-intensive scientific computing. Laboratory users will include international scientific collaborations such as the Laser Interferometer Gravitational-wave Observatory (LIGO), the ATLAS and CMS detectors at the Large Hadron Collider (LHC) at CERN, the Sloan Digital Sky Survey (SDSS), and the U.S. National Virtual Observatory (NVO); application groups affiliated with the NSF Supercomputer centers and EU projects; outreach activities; and Grid technology research efforts. The laboratory itself will be created by deploying a carefully crafted data grid technology base across an international set of sites, each of which provides substantial computing and storage capability accessible via iVDGL software. The 40+ sites, of varying sizes, will include U.S. sites put in place specifically for the laboratory; sites contributed by EU, Japanese, Australian, and potentially other international collaborators; existing facilities that are owned and managed by the scientific collaborations; and facilities placed at outreach institutions. These sites will be connected by very high speed national and transoceanic networks. Several Grid Operations Centers (GOCs), located in different countries, will provide the essential management and coordination elements required to ensure overall functionality and to reduce operational overhead on resource centers. The U.S. portion of iVDGL was funded by NSF for $13.65M for the period 2001–2006. The DataTAG project, recently funded by the European Commission for 4M Euros, will complement iVDGL by providing a very high-performance transatlantic research network operating initially at 2.5 Gbits/sec. is to create a large-scale intercontinental Grid testbed involving the DataGRID [3,4] project, several national projects in Europe, and related Grid projects in the USA. This will allow exploration of advanced networking technologies and interoperability issues between different Grid domains. 7 Common Infrastructure Given the international nature of the experiments participating in these projects (some of them, like the LHC experiments, are participating in several projects), There is widespread recognition by scientists in these projects of the importance of developing common protocols and tools to enable inter-Grid operation avoid costly duplication. Scientists from several of these projects are 8 Summary Data Grid technologies embody entirely new approaches to the analysis of large data collections, in which the resources of an entire scientific community are brought to bear on the analysis and discovery process, and data products are made available to all community members, regardless of location. Large interdisciplinary efforts recently funded in the U.S. and EU have begun research and development of the basic technologies required to create working Data Grids. Over the coming years, they will deploy, evaluate, and optimize Data Grid technologies on a production scale, and integrate them into production applications. The experience gained with these new information infrastructures, providing transparent managed access to massive distributed data collections, will be applicable to large-scale data-intensive problems in a wide spectrum of 15 scientific and engineering disciplines, and eventually in industry and commerce. Such systems will be needed in the coming decades as a central element of our information-based society. References 1 “The Grid: Blueprint for a New Computer Architecture”, eds., Ian Foster and Carl Kesselman, http://www.amazon.com/exec/obidos/ASIN/1558604758/107-6767081-5745308. 2 Formerly known by the name “Computational Grid”, the term “Grid” reflects the fact that the resources to be shared may be quite heterogeneous and have little to do with computing per se. 3 I. Foster, C. Kesselman, S. Tuecke, “The Anatomy of the Grid: Enabling Virtual Scalable Organizations”, International Journal of High Performance Computing Applications, 15(3), 200-222, 2001, http://www.globus.org/anatomy.pdf. 4 The total required CPU resources factor in inefficiencies, additional reconstruction passes and simulations. 5 MONARC home page, http://monarc.web.cern.ch/MONARC/. 6 GriPhyN home page, http://www.griphyn.org/. 7 PPDG home page, http://www.ppdg.net/. 8 EU DataGrid home page, http://www.eu-datagrid.org/. 9 See the ICFA Network Task Force Requirements Report (1998), http://l3www.cern.ch/~newman/icfareq98.html. 10 The U.S. CMS example of the hardware, services and manpower requirements for a Tier1 center, including a work breakdown structure for the Tier1 facility at Fermilab may be found in the November 2000 report by V. O’Dell et al. “US-CMS User Facility Subproject”, at http://home.fnal.gov/~bauerdic/uscms/scp/baserev_nov_2000/rcv304.pdf. 11 VRVS home page, http://vrvs.caltech.edu . 12 Livny, M. High-Throughput Resource Management. In Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, 311337. 13 Moore, R., Baru, C., Marciano, R., Rajasekar, A. and Wan, M. Data-Intensive Computing, in Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, 105-129. 14 Johnston, W.E., Gannon, D. and Nitzberg, B., Grids as Production Computing Environments: The Engineering Aspects of NASA's Information Power Grid. In Proc. 8th IEEE Symposium on High Performance Distributed Computing, 1999, IEEE Press. 15 Stevens, R., Woodward, P., DeFanti, T. and Catlett, C. From the I-WAY to the National Technology Grid. Communications of the ACM, 40(11):50-61. 1997. 16 Information Power Grid home page, http://www.ipg.nasa.gov/. 16 17 Beiriger, J., Johnson, W., Bivens, H., Humphreys, S. and Rhea, R., Constructing the ASCI Grid. In Proc. 9th IEEE Symposium on High Performance Distributed Computing, 2000, IEEE Press. 18 Fischer, P., McKay, T.A., Sheldon, E., Connolly, A., Stebbins, A. and the SDSS collaboration: Weak Lensing with SDSS Commissioning Data: The Galaxy-Mass Correlation Function To 1/h Mpc, Astron.J. in press, 2000. 19 Luppino, G.A. and Kaiser, N., Ap.J. 475, 20, 1997. 20 Tyson, J.A., Kochanek, C. and Dell’Antonio, I.P., Ap.J., 498, L107, 1998. 21 PPDG home page, http://www.ppdg.net/. 22 Earth System Grid homepage, http://www.earthsystemgrid.org/. 23 GriphyN Project home page, http://www.griphyn.org/. 24 TeraGrid home page, http://www.teragrid.org/. 25 International Virtual Data Grid Laboratory home page, http://www.ivdgl.org/. 26 Globus home page, http://www.globus.org/. 27 NCSA home page at http://www.ncsa.edu/. 28 These applications are described more fully at http://www.teragrid.org/about_faq.html. 29 The ATLAS Experiment, A Toroidal LHC ApparatuS, http://atlasinfo.cern.ch/Atlas/Welcome.html 30 The CMS Experiment, A Compact Muon Solenoid, http://cmsinfo.cern.ch/Welcome.html 31 BaBar, http://www.slac.stanford.edu/BFROOT 32 D0, http://www-d0.fnal.gov/ 33 STAR, http://www.star.bnl.gov/ 34 Jlab experiments, http://www.jlab.org/ 35 SAM, http://d0db.fnal.gov/sam/ 36 LIGO home page, http://www.ligo.caltech.edu/. 37 SDSS home page, http://www.sdss.org/. 17