Secure, Reliable, and Efficient Data Replica Management in Grid Networks Kelly Clynes and Caitlin Minteer Radford University Abstract This paper will describe two services that are fundamental to any Data Grid: high-speed transport and replica management. The replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas. Replica Location Service as well as a brief overview of the Globus Toolkit will be briefly covered as far as the justification for the Data Management topic and the four main components of the Globus Toolkit will also be discussed. 1 Introduction The term grid computing refers to the emerging computational and networking infrastructure that is designed to provide pervasive, uniform and reliable access to data, computational, and human resources distributed over wide area environments. Grid computing has developed in order to apply the resources of many computers in a network to a single scientific or technological problem at the same time. Data-intensive, high-performance computing applications in the grid require the efficient management and transfer of information in a widearea, distributed computing environment using file sizes as large as terabytes or petabytes. Massive data sets must be shared by a large community of hundreds or thousands of researchers who are distributed around the world. These researchers require efficient transfer of large data sets to perform analyses at their local sites or at other remote resources. In many cases, the researchers create local copies or replicas of the experimental data sets to overcome wide-area data transfer latencies. Grid computing uses software to divide and distribute pieces of a program to as many as several thousand computers. The Globus Toolkit is a technology for the "Grid," which is an open source toolkit for building computing grids developed by the Globus Alliance. It lets people share computing power, databases, and other tools securely online across corporate, institutional, and geographic boundaries without sacrificing local independence. The toolkit not only includes software services but also libraries for resource monitoring, discovery, security, and file management. It is packaged as a set of components that can be used either independently or together to develop applications. The Globus Toolkit has grown through an open-source strategy that encourages a broader, more rapid adoption and leads to greater technical innovation, as the open-source community provides continual enhancements to the product. In addition to being a central part of science and engineering projects that total nearly a half-billion dollars internationally, the Globus Toolkit is a substrate on which leading IT companies are building significant commercial Grid products. Every organization has different modes of operation. Collaboration between multiple organizations is affected by incompatibility of resources such as data archives, computers, and networks. The Globus Toolkit was created to remove obstacles that prevent seamless collaboration. Its core services, interfaces and protocols allow users to access remote resources as if they were located within their own machine room while at the same time preserving local control over who can use resources and when. This paper discusses two fundamental data management components which include Reliable File Transfer and Replica Location Service. It also provides a brief overview on Data Management, GridFTP, and Replica Management. 2 Secure, Reliable, Transfer Data-intensive applications such as scientific applications require two fundamental data management components, upon which higher-level components can be built: Reliable File Transfer (RFT) is a web service that provides “job scheduler”-like functionality for data movement. Protocol is used in wide area environments. Ideally, this protocol would be universally adopted to provide access to the widest variety of available storage systems. 2 Replica Location Service (RLS) includes services for registering and locating all physical locations for files and collections. Higher-level services that can be built upon these fundamental components include reliable creation of a copy of a large data collection at a new location; selection of the best replica for a data transfer operation based on performance estimates provided by external information services; and automatic creation of new replicas in response to application demands. 3 The Globus Architecture and Data Management There are four main components of Globus Toolkit, which include The Grid Security Infrastructure (GSI), The Globus Resource Management, The Information Management Architecture, and the Data Management Architecture. Grid Security Infrastructure (GSI) provides authentication and authorization services using public key certificates as well as Kerberos authentication. The Globus Resource Management architecture provides a language for specifying application requirements and mechanisms for immediate and advance reservations of one or more computational components. This architecture also provides several interfaces for submitting jobs to remote machines. The Globus Information Management architecture provides a distributed scheme for publishing and retrieving information about resources in the wide area environment. A distributed collection of information servers is accessed by higher-level services that perform resource discovery, configuration and scheduling. The last major component of Globus is the Data Management architecture. The Globus Data Management architecture, or Data Grid, provides two fundamental components: a universal data transfer protocol for grid computing environments called GridFTP and a Replica Management infrastructure for managing multiple copies of shared data sets. 3 4 GridFTP: A Secure, Efficient Data Transport Mechanism The FTP is a widely implemented and well-understood IETF standard protocol. The FTP protocol was extended because it was observed that FTP is the protocol most commonly used for data transfer on the Internet and the most likely candidate for meeting the Grid’s needs. As a result, there is a large base of code and expertise from which to build. The FTP protocol provides a well-defined architecture for protocol extensions and supports dynamic discovery of the extensions supported by a particular implementation. Numerous groups have added extensions through the IETF, and some of these extensions will be particularly useful in the Grid. In addition to client/server transfers, the FTP protocol also supports transfers directly between two servers, mediated by a third party client (i.e. “third party transfer”). There is a universal grid data transfer and access protocol called GridFTP that provides secure, efficient data movement in Grid environments. This protocol, which extends the standard FTP protocol, provides a superset of the features offered by the various Grid storage systems currently in use. Using GridFTP as a common data access protocol would be mutually advantageous to grid storage providers and users. Storage providers would gain a broader user base, because their data would be available to any client, while storage users would gain access to a broader range of storage systems and data. The following diagrams are an illustration of GridFTP. The Control Channel (CC) is the path between client and server which is used to exchange all information needed to establish data channels. The Data Channel (DC) is the network pathway that the files flow over. The Control Channel Interpreter (CCI) is the server side implementation of the control channel functionality. Data Protocol Interpreter (DPI) handles the actual transferring of files and the Client side implementation of the control channel functionality. 4 CCI Control Channel Client DPI Data Channel DPI Figure 1: Simple Two Party Transfers CCI DPI ol ntr l Co nne a Ch Data Channel Client Cont r Chan ol nel CCI DPI Figure 2: Simple Third Party Transfer 5 DPI ol ntr l Co anne h C DPI DPI DPI CCI Client Data Channel Data Channel Data Channel C l tro el on nn ha C CCI DPI DPI DPI Figure 3: Striping 5 Replica Management In this section, the Globus Replica Management architecture, which is responsible for managing complete and partial copies of data sets, will be discussed. Replica management is an important issue for a number of scientific applications. As an example, consider a data set that contains petabytes of experimental results for a particle physics application. While the complete 6 data set may exist in one or possibly several physical locations, it is likely that many universities, research laboratories or individual researchers will have insufficient storage to hold a complete copy. Instead, they will store copies of the most relevant portions of the data set on local storage for faster access. Services provided by a replica management system include: Creating new copies of a complete or partial data set Registering these new copies in a Replica Catalog Allowing users and applications to query the catalog to find all existing copies of a particular file or collection of files Selecting the ``best'' replica for access based on storage and network Performance predictions provided by a Grid information service The Globus replica management architecture is a layered architecture. At the lowest level is a Replica Catalog that allows users to register files as logical collections and provides mappings between logical names for files and collections and the storage system locations of one or more replicas of these objects. A Replica Catalog API in C has also been implemented as well as a command-line tool; these functions and commands perform low-level manipulation operations for the replica catalog, including creating, deleting and modifying catalog entries. The basic replica management services that are provided can be used by higher-level tools to select among replicas based on network or storage system performance or automatically to create new replicas at desirable locations. Some of the higher-level services will be implemented in the next generation of the replica management infrastructure. 6 Summary This paper has given a brief overview of the Globus Toolkit, and has briefly covered the justification for the Data Management topic. The RLS and RFT topics were explained, as well as the four main components of the Globus Toolkit, and replica management. 7 References [1] W. Hoschek, J. Jaen-Martinez, A. Samar, H. Stockinger, K. Stockinger, “Data Management in an International Grid Project”, 2000 International Workshop on Grid Computing (GRID 2000), Bangalore, India, December 2000. [2] K. Holtman, “Object Level Replication for Physics”, Proceedings of 4th Annual Globus Retreat, Pittsburgh, July 2000. [3] Globus Data Management Available at: www.globus.org/research/data-management.html [4]B. Tierney, J. Lee, B. Crowley, M. Holding, J. Hylton, F. Drake, "A Network- Aware Distributed Storage Cache for Data Intensive Environments", Proceedings of IEEE High Performance Distributed Computing conference(HPDC-8), August 1999. [5] J. Bresnahan Intro to GridFTP (October 2006) -, Argonne National Laboratory. 8