Secure, Reliable, and Efficient Data Replica Management in Grid

advertisement
Secure, Reliable, and Efficient Data Replica
Management in Grid Networks
Kelly Clynes and Caitlin Minteer
Radford University
Abstract
This paper will describe two services that are fundamental to any Data Grid: high-speed
transport and replica management. The replica management service integrates a replica catalog
with GridFTP transfers to provide for the creation, registration, location, and management of
dataset replicas. Replica Location Service as well as a brief overview of the Globus Toolkit will be
briefly covered as far as the justification for the Data Management topic and the four main
components of the Globus Toolkit will also be discussed.
1 Introduction
The term grid computing refers to the emerging computational and networking
infrastructure that is designed to provide pervasive, uniform and reliable access to data,
computational, and human resources distributed over wide area environments. Grid computing
has developed in order to apply the resources of many computers in a network to a single
scientific or technological problem at the same time. Data-intensive, high-performance computing
applications in the grid require the efficient management and transfer of information in a widearea, distributed computing environment using file sizes as large as terabytes or petabytes.
Massive data sets must be shared by a large community of hundreds or thousands of
researchers who are distributed around the world. These researchers require efficient transfer of
large data sets to perform analyses at their local sites or at other remote resources. In many
cases, the researchers create local copies or replicas of the experimental data sets to overcome
wide-area data transfer latencies.
Grid computing uses software to divide and distribute pieces of a program to as many as
several thousand computers. The Globus Toolkit is a technology for the "Grid," which is an open
source toolkit for building computing grids developed by the Globus Alliance. It lets people share
computing power, databases, and other tools securely online across corporate, institutional, and
geographic boundaries without sacrificing local independence. The toolkit not only includes
software services but also libraries for resource monitoring, discovery, security, and file
management. It is packaged as a set of components that can be used either independently or
together to develop applications. The Globus Toolkit has grown through an open-source strategy
that encourages a broader, more rapid adoption and leads to greater technical innovation, as the
open-source community provides continual enhancements to the product.
In addition to being a central part of science and engineering projects that total nearly a
half-billion dollars internationally, the Globus Toolkit is a substrate on which leading IT companies
are building significant commercial Grid products. Every organization has different modes of
operation. Collaboration between multiple organizations is affected by incompatibility of resources
such as data archives, computers, and networks. The Globus Toolkit was created to remove
obstacles that prevent seamless collaboration. Its core services, interfaces and protocols allow
users to access remote resources as if they were located within their own machine room while at
the same time preserving local control over who can use resources and when.
This paper discusses two fundamental data management components which include
Reliable File Transfer and Replica Location Service. It also provides a brief overview on Data
Management, GridFTP, and Replica Management.
2 Secure, Reliable, Transfer
Data-intensive applications such as scientific applications require two fundamental data
management components, upon which higher-level components can be built:

Reliable File Transfer (RFT) is a web service that provides “job scheduler”-like
functionality for data movement. Protocol is used in wide area environments. Ideally, this
protocol would be universally adopted to provide access to the widest variety of available
storage systems.
2

Replica Location Service (RLS) includes services for registering and locating all
physical locations for files and collections. Higher-level services that can be built upon
these fundamental components include reliable creation of a copy of a large data
collection at a new location; selection of the best replica for a data transfer operation
based on performance estimates provided by external information services; and
automatic creation of new replicas in response to application demands.
3 The Globus Architecture and Data Management
There are four main components of Globus Toolkit, which include The Grid Security
Infrastructure (GSI), The Globus Resource Management, The Information Management
Architecture, and the Data Management Architecture.
Grid Security Infrastructure (GSI) provides authentication and authorization services
using public key certificates as well as Kerberos authentication.
The Globus Resource Management architecture provides a language for specifying
application requirements and mechanisms for immediate and advance reservations of one or
more computational components. This architecture also provides several interfaces for submitting
jobs to remote machines.
The Globus Information Management architecture provides a distributed scheme for
publishing and retrieving information about resources in the wide area environment. A distributed
collection of information servers is accessed by higher-level services that perform resource
discovery, configuration and scheduling.
The last major component of Globus is the Data Management architecture. The Globus
Data Management architecture, or Data Grid, provides two fundamental components: a universal
data transfer protocol for grid computing environments called GridFTP and a Replica
Management infrastructure for managing multiple copies of shared data sets.
3
4 GridFTP: A Secure, Efficient Data Transport Mechanism
The FTP is a widely implemented and well-understood IETF standard protocol. The FTP
protocol was extended because it was observed that FTP is the protocol most commonly used for
data transfer on the Internet and the most likely candidate for meeting the Grid’s needs. As a
result, there is a large base of code and expertise from which to build. The FTP protocol provides
a well-defined architecture for protocol extensions and supports dynamic discovery of the
extensions supported by a particular implementation. Numerous groups have added extensions
through the IETF, and some of these extensions will be particularly useful in the Grid. In addition
to client/server transfers, the FTP protocol also supports transfers directly between two servers,
mediated by a third party client (i.e. “third party transfer”).
There is a universal grid data transfer and access protocol called GridFTP that provides
secure, efficient data movement in Grid environments. This protocol, which extends the standard
FTP protocol, provides a superset of the features offered by the various Grid storage systems
currently in use. Using GridFTP as a common data access protocol would be mutually
advantageous to grid storage providers and users. Storage providers would gain a broader user
base, because their data would be available to any client, while storage users would gain access
to a broader range of storage systems and data.
The following diagrams are an illustration of GridFTP. The Control Channel (CC) is the
path between client and server which is used to exchange all information needed to establish
data channels. The Data Channel (DC) is the network pathway that the files flow over. The
Control Channel Interpreter (CCI) is the server side implementation of the control channel
functionality. Data Protocol Interpreter (DPI) handles the actual transferring of files and the Client
side implementation of the control channel functionality.
4
CCI
Control
Channel
Client
DPI
Data
Channel
DPI
Figure 1: Simple Two Party Transfers
CCI
DPI
ol
ntr l
Co nne
a
Ch
Data
Channel
Client
Cont
r
Chan ol
nel
CCI
DPI
Figure 2: Simple Third Party Transfer
5
DPI
ol
ntr l
Co anne
h
C
DPI
DPI
DPI
CCI
Client
Data
Channel
Data
Channel
Data
Channel
C
l
tro el
on nn
ha
C
CCI
DPI
DPI
DPI
Figure 3: Striping
5 Replica Management
In this section, the Globus Replica Management architecture, which is responsible for
managing complete and partial copies of data sets, will be discussed. Replica management is an
important issue for a number of scientific applications. As an example, consider a data set that
contains petabytes of experimental results for a particle physics application. While the complete
6
data set may exist in one or possibly several physical locations, it is likely that many universities,
research laboratories or individual researchers will have insufficient storage to hold a complete
copy. Instead, they will store copies of the most relevant portions of the data set on local storage
for faster access. Services provided by a replica management system include:

Creating new copies of a complete or partial data set

Registering these new copies in a Replica Catalog

Allowing users and applications to query the catalog to find all existing copies of a
particular file or collection of files

Selecting the ``best'' replica for access based on storage and network

Performance predictions provided by a Grid information service
The Globus replica management architecture is a layered architecture. At the lowest level is a
Replica Catalog that allows users to register files as logical collections and provides mappings
between logical names for files and collections and the storage system locations of one or more
replicas of these objects. A Replica Catalog API in C has also been implemented as well as a
command-line tool; these functions and commands perform low-level manipulation operations for
the replica catalog, including creating, deleting and modifying catalog entries. The basic replica
management services that are provided can be used by higher-level tools to select among
replicas based on network or storage system performance or automatically to create new replicas
at desirable locations. Some of the higher-level services will be implemented in the next
generation of the replica management infrastructure.
6 Summary
This paper has given a brief overview of the Globus Toolkit, and has briefly covered the
justification for the Data Management topic. The RLS and RFT topics were explained, as well as
the four main components of the Globus Toolkit, and replica management.
7
References
[1] W. Hoschek, J. Jaen-Martinez, A. Samar, H. Stockinger, K. Stockinger, “Data Management in
an International Grid Project”, 2000 International Workshop on Grid Computing (GRID 2000),
Bangalore, India, December 2000.
[2] K. Holtman, “Object Level Replication for Physics”, Proceedings of 4th Annual
Globus Retreat, Pittsburgh, July 2000.
[3] Globus Data Management
Available at: www.globus.org/research/data-management.html
[4]B. Tierney, J. Lee, B. Crowley, M. Holding, J. Hylton, F. Drake, "A Network- Aware Distributed
Storage Cache for Data Intensive Environments", Proceedings of IEEE High Performance
Distributed Computing conference(HPDC-8), August 1999.
[5] J. Bresnahan Intro to GridFTP (October 2006) -, Argonne National Laboratory.
8
Download