Persistent Archive Concept Paper Persistent Archive Research Group Global Grid Forum Draft 3.0

advertisement
Persistent Archive Concept Paper
Persistent Archive Research Group
Global Grid Forum
Draft 3.0
Reagan W. Moore (San Diego Supercomputer Center)
moore@sdsc.edu
June 30, 2002
The Persistent Archive Research Group of the Grid Forum promotes the development of
an architecture for the construction of persistent archives. The architecture addresses the
issue of management of technology evolution. During the lifetime of the persistent
archive, each component may be upgraded multiple times. The challenge is creating an
architecture that maintains the authenticity of the archived collection, while minimizing
the effort needed to incorporate new technology. Fortunately, persistent archives are
conceptually equivalent to virtual data grids. They both need mechanisms to support
access to heterogeneous types of storage systems and information repositories, while
supporting the re-creation of derived data products. Hence the persistent archive research
group is addressing issues related to how persistent archives can be built from virtual data
grids. Virtual data grids are distributed systems that tie together data management
systems and compute resources. Virtual data grids are differentiated from data grids by
the ability to access derived data products and to re-create the derived data products from
a process description.
A persistent archive manages data collections that are treated as derived data products. A
description of the processing steps used to create a collection can be archived. A request
for access to the collection can result in a query against an instantiated version of the
collection, or the execution of the processing steps to instantiate the collection. Hence a
persistent archive is a virtual data grid that manages access to derived data collections,
and manages the re-creation of derived data collections if they are not already
instantiated. Persistent archives also manage the migration of data from old storage
systems to new storage systems and manage transformative migrations, in which the
encoding standard used to describe a data entity is changed to a new standard.
Management of transformative migrations again is equivalent to management of derived
data products in a virtual data grid.
Archivists rely on persistent archives to support archival processes, including appraisal,
accessioning, arrangement, description, preservation, and access. A virtual data grid
provides support for data entities, which are either the data collection or the digital
entities organized by the data collection. The virtual data grid capabilities can be
separated into the functionalities that support archival practice:
Appraisal:
− Selection of materials to archive – Data grids provide a logical name space that
supports the organization of digital entities into collection/sub-collection hierarchies.
1
The logical name space is decoupled from the underlying storage systems, making it
possible to sort digital entities without moving them. It is possible to represent a
digital entity by a handle (such as a URL), register the pointers into the logical name
space, and organize the pointers independently of the data.
Accessioning:
− Controlled data import – Data grids put digital entities under management control,
such that automated processing can be done across an entire collection. Data grids
provide support for proxies (remote processes) that can be used to validate data
models, extract metadata, and authenticate the identification of the submitter.
Arrangement:
− Context management – Data grids use collections to define the context to associate
with data entities. The context includes provenance information describing the
processes under which the data entities were created, attributes used to support
information discovery to identify an individual data entity, and relationships that can
be used to determine whether associated attribute values are consistent with implied
knowledge about the collection, or represent anomalies and artifacts. The context
management also is used to control the level of granularity associated with the
organization of the data entities into collections/sub-collections. Containers, the
digital equivalent of a cardboard box, are used to physically aggregate data entities.
Description:
− Global name space – The ability to identify derived data products is based on
persistent logical identifiers that are independent of the local storage system file
names. For persistent archives this includes the ability to provide persistent logical
identifiers for the data entities stored within the data collections. The hierarchical
organization of the global name space into collections/sub-collections is extended by
supporting sub-collection specific metadata. Each sub-collection is described by an
extensible set of attributes that can be defined independently of other sub-collections.
− Derived product characterization – Both the derived data products and the processes
used to generate the derived data products can be characterized in virtual data grids.
For persistent archives, the derived data product can be a data collection or the
transformative migration of a data object. Infrastructure independent representations
are used to describe both the derived data product and the processes used to re-create
the derived data product.
Preservation:
− Instantiation – A virtual data grid provides the ability to execute a process
description. For a persistent archive, this is the ability to instantiate a data collection
from its infrastructure independent representation.
− Data access interoperabililty – The ability to migrate data products between different
types of storage systems is provided by data grids. For persistent archives, this
includes the ability to migrate a collection from old database technology onto new
database technology.
2
− Disaster recovery – Data grids manage replicas of digital entities, replicas of
collection attributes, and replicas of collections. The replicas can be located at
geographically remote sites, ensuring safety from local disasters.
− Persistency – Virtual data grids provide a consistent environment, which guarantees
that the administrative attributes used to identify derived data products always remain
consistent with migrations performed on the data entities. The consistent state is
extended into a persistent state through management of the information encoding
standards used to create platform independent representations. The ability to migrate
from an old representation of an information encoding standard to a new
representation leads to persistent management of derived data products.
− Authenticity – Data grids track all operations done on each data entity, to be able to
guarantee that the information and knowledge content of each digital entity remains
unchanged. Audit trails and digital signatures are used.
Access:
− Derived data product access – Virtual data grids provide direct access to the derived
data product when it exists. This implies the ability to store information about
derived data products within a management collection that can be queried. A similar
management collection, or finding aid, is used to characterize the multiple data
collections and contained data entities that are stored in a persistent archive.
− Data transport – Data grids provide transport mechanisms for accessing data in a
distributed environment that spans multiple administration domains. This includes
support for moving data and metadata in bulk, while authenticating the user. Data
grids also provide multiple roles for characterizing the allowed operations on the
stored data, independently of the underlying storage systems. Users can be assigned
the capabilities of a curator, with the ability to create new sub-collections, or
annotator with the ability to add comments about the digital entities, or submitter,
with the ability to write data into a specified sub-collection, or public user, with the
ability to read selected sub-collections.
Data grids provide the mechanisms needed to support distributed data access across
heterogeneous data resources. The heterogeneous data resources represent different
versions of storage systems and databases as they evolve over time. When a new
infrastructure component is added to a persistent archive, both the old version and new
version will be accessed simultaneously while the data and information content is
migrated onto the new technology. Through use of replication, the migration can be done
transparently to the users.
Persistent Archive Functionality Requirements
The requirements for a persistent archive can be expressed in general as "transparencies"
that hide virtual data grid implementation details. Examples include digital entity name
transparency, data location transparency, platform implementation transparency,
encoding standard transparency, and authentication transparency or single sign-on
systems. The capabilities of a persistent archive can be characterized as the set of
“transparencies” needed to manage technology evolution. Implementations exist for at
3
least four key functionalities or transparencies that simplify the complexity of accessing
distributed heterogeneous systems:
− Name transparency – The ability to identify a desired digital entity without knowing
its name can be accomplished by queries on descriptive attributes, organized as a
collection. Persistent archives are inherently archives of collections of digital entities
that map from unique attribute values to a global, persistent, identifier.
− Location transparency – The ability to retrieve a digital entity without knowing where
it is stored can be accomplished through use of a logical name space that maps from
the global, persistent, identifier to a physical storage location and physical file name.
If the data grid owns the digital entities (stored under the data grid user ID), the
administrative attributes for storage location and file name can be self-consistently
updated every time the digital entity is moved.
− Platform implementation transparency – The ability to retrieve a digital entity from
arbitrary types of storage systems can be accomplished through use of a data grid that
understands the protocols needed to talk to the storage systems. Every time a new
type of storage system is added to the persistent archive, a new driver is added to the
data grid to map from the new storage access protocol to the data grid data transport
protocol.
− Encoding standard transparency – The ability to display a digital entity requires
understanding the associated data model and encoding standard for information. If
infrastructure independent standards are used for the data model and encoding
standard (non-proprietary, published formats), a persistent archive can use
transformative migrations to maintain the ability to display the digital entities. The
transformative migrations will need to be defined only between the infrastructure
independent standards.
The infrastructure that supports the above transparencies exists in multiple data grid
implementation. A research issue is whether additional features need to be added to
virtual data grids to support persistent archives. An example is the need for authenticity,
which implies a data grid in which only authorized actions can take place. Every
operation within the persistent archive must be tracked, and the corresponding metadata
updated to guarantee consistency of the metadata.
A second research issue is whether the levels of abstraction associated with virtual data
grids are consistent with operations in persistent archives. One can think of a data grid as
the set of abstractions that manage differences across storage repositories, information
repositories, knowledge repositories, and execution systems. Data grids also provide
abstraction mechanisms for interacting with the objects that are manipulated within the
grid, including data entities (logical namespace), processes (service characterizations or
application specifications), and interaction environments (portals). The data grid
approach can be defined as a set of services, and the associated APIs and protocols used
to implement the services. The data grid is augmented with portals that are used to
assemble integrated work environments to support specific applications or disciplines. A
major question is whether a persistent archive is better implemented as a virtual data grid,
4
incorporating the required functionality directly into the grid, or as a portal, with the
required authenticity and management control implemented as an application interface.
Data Grid Implementations
To better understand the current status of data grids, we present an analysis of the
capabilities that are already provided by production systems. The Global Grid Forum is
promoting the development of standards for the implementation of data grids. One of the
challenges is defining the minimal set of functionalities that are needed to implement a
data grid. Is it possible to define a common set of capabilities across all of the data grid,
data collection, digital library, and persistent archive projects that are already in
production? A second challenge is to understand the set of capabilities that may be
required by future applications. To answer both questions, a comparison has been made
between the Storage Resource Broker (SRB) data grid from the San Diego
Supercomputer Center, the European DataGrid replication environment (based upon
GDMP, a project in common between the European DataGrid and the Parti
Data Grid, and augmented with an additional product
DataGrid
of the
forEuropean
storing and retrieving meta-data in relational databases called Spitf
components
), the Scientific Data Management (SDM) data grid from Pacific Northwest
Laboratory, the Globus toolkit, the Sequential Access using Metadata (SAM) data grid
from Fermi National Accelerator Laboratory, the Magda data management system from
Brookhaven National Laboratory, and the JASMine data grid from Jefferson National
Laboratory. These systems have evolved as the result of input by user communities for
the management of data across heterogeneous, distributed storage resources.
EGP, SAM, Magda, and JASMine data grids support high energy physics data. The
SDM system provides a digital library interface to archived data for PNL and manages
data from multiple scientific disciplines. The Globus toolkit provides services that can be
composed to create a data grid. The SRB data handling system is used in projects for
multiple US federal agencies, including the NASA Information Power Grid (digital
library front end to archival storage), the DOE Accelerated Strategic Computing Initiative
(collection-based data management), the National Library of Medicine Visible Embryo
project (distributed data collection), the National Archives Records Administration
(persistent archive), the NSF National Partnership for Advanced Computational
Infrastructure (distributed data collections for astronomy, earth systems science, and
neuroscience), the Joint Center for Structural Genomics (data grid), and the National
Institute of Health Biomedical Informatics Research Network (data grid).
The systems we examine therefore include not only data grids, but also distributed data
collections, digital libraries and persistent archives. Since the core component of each
system is a data grid, we can expect common capabilities to exist across the multiple
implementations. The systems that provide the largest number of features tend to have
the most diverse set of user requirements.
The comparison is an attempt at understanding what data grid architectures provide to
5
meet existing application requirements. The capabilities are organized into functional
categories, such that a given capability is listed only once. The categories have been
chosen based on the need to manage a logical name space or replica catalog, the
management of attributes in the logical name space, the storage abstraction for accessing
remote storage systems, the types of data manipulation, and the data grid architecture.
Since the listed data grids have been in use for multiple years, the features that have been
developed represent a comprehensive cross-section of the features in actual use by
production systems. The table listing the capabilities is given in Appendix A. Terms
used in the comparison are explained in Appendix B.
Common Data Grid Capabilities:
What is most striking is that common data grid capabilities are emerging across all of the
data grids. Appendix C lists the common features organized by functional category.
Each data grid implements a logical name space that supports the construction of a
uniform naming convention across multiple storage systems. The logical name space is
managed independently of the physical file names used at a particular site, and a mapping
is maintained between the logical file name and the physical file name. Each data grid
has added attributes to the name space to support location transparency, file
manipulation, and file organization. Most of the grids provide support for hierarchical
logical folders within the namespace, and support for ownership of the files by a
community or collection ID.
The logical name space attributes typically include the replica storage location, the local
file name, and user-defined attributes. Mechanisms are provided to automate the
generation of attributes such as file size and creation time. The attributes are created
synchronously when the file is registered into the logical name space, but many of the
grids also support asynchronous registration of attributes.
Most of the grids support synchronous replica creation, and provide data access through
parallel I/O. The grids check transmission status and support data transport restart at the
application level. Writes to the system are done synchronously, with standard error
messages returned to the user. At the moment, the error messages are different across
each of the data grids. The grids have statically tuned the network parameters (window
size and buffer size) for transmission over wide area networks. Most of the grids provide
interfaces to the GridFTP transport protocol.
The most common access APIs to the data grids are a C++ I/O library, a command line
interface and a Java interface. The grids are implemented as distributed client server
architectures. Most of the grids support federation of the servers, enabling third party
transfer. All of the grids provide access to storage systems located at remote sites
including at least one archival storage system. The grids also currently use a single
catalog server to manage the logical name space attributes. All of the data grids provide
some form of latency management, including caching of files on disk, streaming of data,
and replication of files.
6
Persistent Archive Components
Given a consensus on the set of capabilities provided by a data grid, it is possible to
identify those capabilities that are relevant to the creation of a persistent archive. In
Appendix A., a column was included to identify the relevance of each capability towards
the implementation of a persistent archive. Capabilities were identified as either
necessary (marked by a “Yes”), or as a useful extension (marked by an “E”).
A description of how a particular data grid capability would be used by a persistent
archive is provided in Appendix D. The description is organized from the perspective of
the persistent archive community into the infrastructure components needed to support
data ingestion, manage the persistent archive, and support discovery and access.
Conclusion:
By comparing the implementations of seven different data grids, a common set of data
management capabilities have been identified for accessing data in distributed
environments. Based upon these common capabilities, a set of core capabilities have
been identified as the components of a data-grid based implementation of a persistent
archive.
The proposed set of core and extended grid capabilities needed to implement a persistent
archive is open to debate. While a common set is evident for data grids, a demonstration
is still needed that these services are sufficient for building a persistent archive. Prior
efforts at identifying persistent archive components have identified support for collection
creation through metadata extraction, creation of logical name spaces to manage data in
distributed environments, collection building from archived forms, and discovery and
access mechanisms as the central components of a persistent archive. This paper
provides a first appraisal of which of the core grid operations and extended grid
operations are needed to implement a persistent archive.
Acknowledgements:
The data grid and toolkit characterizations were provided by the following persons. This
comparison was only possible through their support. Igor Terekhov (Fermi National
Accelerator Laboratory), Torre Wenaus (Brookhaven National Laboratory), Scott
Studham (Pacific Northwest Laboratory), Chip Watson (Jefferson Laboratory), Heinz
Stockinger and Peter Kunszt (CERN), Ann Chervenak (Information Sciences Institute,
University of Southern California), Arcot Rajasekar (San Diego Supercomputer Center).
7
Appendix A.
A comparison of the capabilities provided by data grids
In the following table, areas where an implementation is planned for
marked with P . Areas where information has not been received are l
capabilities have been organized into eleven categories, with up to t
capabilities per category. A total of 152 capabilities are characteri
quarters of the capabilities (120) have been implemented in at least
About one-third of the capabilities (50) have been implemented in at
grids. These 50 capabilities comprise the current core features of d
An eighth column has been added to indicate whether a particular data
be part of persistent archive infrastructure. The decision is based
capability will make it feasible to manage a data collection while th
technology changes. Mechanisms that support an abstraction of storag
abstraction of information repositories are included as part of the p
capabilities. For the persistent archive column, an E indicates an
improve the ability to manage a persistent archive. A total of 80 ca
identified as essential for a persistent archive, with another 50 cap
provide an extended environment.
Capability
Logical name space
Pers.
Globus
EDG
Arch.
Toolkit
JASMine
Magda SAM
SDM
SRB
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Logical name space independen
Yes
physical name space
Yes
P
Yes
Yes
Yes
Yes
Yes
P
P
Yes
No
No
Yes
Yes
No
No
Yes
Yes
Yes
No
Yes
No
No
Yes
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
P
P
P
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Data referenced by catalog ow
Yes
Collection ID
P
P
No
Yes
Yes
Yes
Yes
Registration of files as obje
Yes
name space
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Registration of databases as
logical name space
No
No
No
Yes
No
Yes
Hierarchical logical folders Yes
Unix node operations on logic
space — directory manipulatio
Yes
rename, delete, create and r
and folders
Recursive operations on logic
space directories for both st E
retrieval
Deletion of entities from log
Yes
space
Delete by setting a deletion
E
Delete by removing attribute Yes
Soft links between objects in
folders so that a single file Yes
in multiple folders
Data referenced by catalog ow
No
user ID
E
8
Registration of database blob
No
in logical name space
No
Registration of persistent ha
Yes
objects in logical name space
No
Capability
No
Pers.
Globus
EDG
Arch.
Toolkit
No
No
Yes
No
No
No
JASMine
Magda SAM
No
Yes
No
Yes
SDM
SRB
Logical name space Attributes Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
System level attributes Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Replica attributes for storag
Yes
local file name
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Replica attributes for type o
system
No
No
No
Yes
Yes
Yes
Yes
E
Role based access control lis
Yes
logical name space
P
P
P
No
Yes
Yes
Group access control lists fo
Yes
logical name space
P
P
P
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Yes
Access control lists for logi
space attributes, to control Yes
add, and change metadata
Extended set of user roles (a
Yes
collection curator, annotator
Access control lists for reso No
Yes
I/O access pattern
E
No
Yes
No
Audit trails for updates and/ Yes
No
P
No
P
No
No
Yes
No
No
Yes
Yes
Yes
No
Yes
No
Yes
No
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
User defined attributes for d No
Yes
No
Yes
Yes
Yes
Yes
User defined attributes for c No
P
No
No
Yes
Annotation attributes
E
User profiles to describe sto
No
history
Discipline specific attri Yes
Yes
Yes
Dublin Core attributes
Yes
Template based metadata extra
Yes
catalog attributes
No
Version attribute
User level attributes
Yes
Yes
No
Yes
Yes
Yes
No
No
No
P
Yes
Yes
Yes
No
No
No
No
P
No
P
No
No
No
No
External catalog accessible f
attributes
E
No
Physics tags
No
P
9
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Capability
Pers.
Globus
EDG
Arch.
Toolkit
Attribute manipulation
Yes
Query interface to discover f
Yes
attributes
Automated attribute generatio
Yes
time stamp
User managed synchronous attr
Yes
update
Asynchronous annotation of ob
E
logical name pace
Export of attributes as XML f
Yes
python file
Bulk asynchronous load of att Yes
Bulk asynchronous load of att
E
from XML or python file
Capability
Yes
Yes
JASMine
Yes
P
Magda SAM
SDM
SRB
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No
No
No
No
Yes
No
Yes
SDM
SRB
Pers.
Globus
EDG
Arch.
Toolkit
JASMine
Magda SAM
Data manipulation
Synchronous creation of repli
associated metadata creation
Asynchronous creation of repl
Registration of user owned da
replica of an existing object
name space
Master instances of replicas
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
E
P
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
Yes
Yes
P
Yes
Yes
Load balancing across physica
E
No
Yes
No
Yes
Yes
Yes
Containers for aggregating sm Yes
Logical aggregation of object
No
on tape
Container locking on writes
Yes
Updates to containers by appe Yes
Replication of containers
No
Yes
No
No
No
No
No
No
Yes
No
No
Yes
No
Yes
No
No
No
No
No
No
N/A
No
Yes
No
No
No
No
No
Yes
Yes
No
No
No
No
No
Yes
Container replica invalidatio Yes
Staging of containers from ar
Yes
disk
Synchronization of containers Yes
No
No
No
No
No
Yes
No
No
No
No
No
Yes
No
No
No
No
No
Yes
Multiple disk caches for cont Yes
No
No
No
No
No
Yes
Multiple archives for storing Yes
Logical container names that
Yes
multiple physical containers
No
No
No
No
No
Yes
No
No
No
No
No
Yes
10
Capability
Data Access
Pers.
Globus
EDG
Arch.
Toolkit
JASMine
Magda SAM
SDM
SRB
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Parallel I/O support
E
Yes
Yes
Yes
Yes
Yes
No
Yes
Parallel I/O on get/put comma
E
Yes
Yes
Yes
Yes
Yes
No
Yes
Parallel I/O on partial file
Transmission status checking
level
E
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
E
Yes
Yes
Yes
No
No
Yes
No
Transmission restart after in
Yes
application level
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Storage completion at end of
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
E
No
No
No
No
Yes
Yes
Yes
Yes
Transmission block tagging to
transmission restart after in
Replication completion when w
of n physical resources
Standard error messages from
systems, network, and data ha Yes
system
Striping support
E
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
P
Thread safe client
Yes
Yes
No
Yes
Yes
Yes
E
Static network tuning
Yes
Dynamic network tuning (windo
E
buffer size)
GridFtp protocol support
E
Yes
Yes
Yes
No
P
No
Yes
No
Yes
No
No
P
No
No
No
Yes
P
Yes
P
No
P
TCP/IP custom control protoco No
No
No
No
No
P
No
Yes
Java parallel custom control
No
No
Yes
No
No
No
No
E
User-selectable transfer
Gridftp, (scp No
Push and Pull data movement
Capability
Multiple Access APIs
Yes
E
Pers.
Globus
EDG
Arch.
Toolkit
Yes
Yes
No
Yes
Yes
Yes
JASMine
Magda SAM
SRB
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Remote data access by I/O red
from Linux or Solaris system
E
Yes
No
No
No
No
Yes
C I/O library API
E
Yes
Yes
No
No
Yes
Yes
C++ I/O library API
Command line interface
Java interface
Yes
SDM
E
Yes
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
E
Yes
P
Yes
Yes
Yes
Yes
Web interface
Yes
P
No
Yes
Yes
Yes
No
Yes
Web Sevices Interface (WSDL)
Yes
P
P
Visual Basic interface
No
No
No
No
No
No
Yes
Yes
XML query interface
E
P
Yes
No
No
P
No
No
DLL/API
interface for Python No
P
P
Yes
No
Yes
Predicate assertion interface
E
No
No
No
No
Yes
SDLIP interface
E
No
No
No
No
P
Windows browser interface
E
No
No
No
Yes
Yes
11
No
Capability
Globus
Pers.
EDG
Toolkit
Arch.
JASMine
Magda SAM
Distributed client-server arc Yes
Yes
Yes
Yes
Yes
Federated client server
Yes
Yes
Yes
Yes
Yes
Distributed servers
Yes
Yes
Yes
Yes
Yes
Distributed storage systems
Yes
Yes
Yes
Yes
Yes
64-bit name space
Yes
Logical resources that repres
E
physical resources
Third party transfer from sto
Yes
to specified remote destinati
Yes
GSI authentication
PKI authentication
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
No
P
Yes
E
Yes
Yes
Yes
No
Yes
No
Yes
P
Yes
Via
P
gsiftp
No
P
Yes
No
No
Yes
No
No
No
Parti
Pers.
Globus
EDG
Arch.
Toolkit
JASMine
Magda SAM
Yes
Yes
Yes
Yes
Yes
Streaming
Yes
Yes
Yes
Yes
Yes
Caching
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
No
Containers for metadata
Yes
Remote I/O proxies for aggreg
commands, remote data filteri Yes
metadata extraction
Remote Proxies through
DataCutter
E
No
No
Prefetch (partial file cachin No
Remote Proxies through
GridFTP
Yes
Yes
Yes
Latency Management
Containers for data
SRB
No
Challenge response authentica No
Ticket based access control f
accesses and restricted time No
access
Capability
Yes
SDM
Yes
Yes
Yes
SDM
SRB
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
No
No
No
Yes
No
No
No
No
No
Yes
Yes
E
No
Staging
Yes
Yes
Status checking
Yes
Yes
Replication
Yes
Yes
12
Yes
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
No
Yes
Capability
Multiple system support
Pers.
Globus
EDG
Arch.
Toolkit
JASMine
Magda SAM
SDM
SRB
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Storage Resource Manager inte E
Database interface for readin
E
blobs
Database interface for querie
relational database registere E
object
Database interface for export
Yes
attributes as an XML file
Archive interface to at least Yes
Yes
No
Yes
P
Yes
No
Yes
No
No
No
No
No
Yes
Yes
No
No
No
No
Yes
Yes
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Archive interface to HPSS
E
Yes
Yes
No
Yes
No
Yes
Archive interface to DMF
E
Yes
Yes
No
No
No
Yes
Archive interface to ADSM
E
Yes
No
No
No
No
Yes
Archive interface to Enstore
No
No
Archive interface to UniTree
E
Yes
No
No
No
No
Yes
Archive interface to JASMine
Archive interface to HPSS for
sizes greater than 2 GB
Archive interface to Castor
Single catalog server, databa
technology used to replicate
Hierarchical distributed cata
No
No
No
Yes
No
No
No
E
Yes
No
No
No
No
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
No
P
P
No
No
Capability
No
Pers.
Globus
EDG
Arch.
Toolkit
JASMine
Yes
Yes
No
No
Magda SAM
Performance enhancements
Performance for import/export
greater than 20 files/sec
Performance for import/export
greater than 1100 files/sec
Bulk metadata load
Yes
P
No
Yes
Pre-spawned processes for dat
E
No
No
No
Database index optimization
E
Yes
Yes
Database communication tuning
E
Yes
Yes
No
Yes
No
No
SDM
SRB
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
No
Yes
13
Yes
Yes
Yes
Capability
Robustness: Fault tolerance a
handling
Automatic fail over to altern
when the first copy is unavai
Automatic retrials in metadat
access
Automatic retrials in archive
Data transfer resumption upon
restart (crash and reboot), t
the user jobs
Single-command resumption of
user jobs upon system restart
Configurable time-outs on use
every resource (to protect ag
abandoned/misbehaved user job
Handling of mass storage syst
software errors, including pr
compliance and other aberrati
Handling of underlying file t
(such as ftp) errors, includi
protocol compliance
File checksum
Globus
Pers.
EDG
Toolkit
Arch.
JASMine
Magda SAM
SDM
SRB
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
E
No
No
Yes
No
E
No
Yes
No
Yes
P
E
Yes
Yes
No
Yes
P
E
Yes
P
Yes
Yes
No
E
P
Yes
P
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
P
Yes
Yes
Yes
Yes
Yes
P
Yes
No
No
No
Yes
98
60
80
76
85
64
136
Number of capabilities (prese
130
planned) (total of 152)
14
Yes
No
Yes
Appendix B.
Definition of terms used in the capability comparison
Terms used in Appendix A to describe data grid capabilities are defin
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Registration corresponds to adding objects to the logical name spa
logical name and storing a pointer to the file name used on the st
Attributes represent information that is managed for each object t
the logical name space
Folders are equivalent to directories in a file system, but are us
in the logical name space
Soft links represent the cross registration of a single physical d
folders in the logical name space
Shadow links represent pointers to objects owned by individuals.
register individual owned data into the logical name space, withou
of the object on storage systems managed by the logical name space
Replicas are copies of a file registered into the logical name spa
on either the same storage system or on different file systems.
A container is an aggregation of multiple data files into a single
Curation control corresponds to the administration tasks associate
managing a logical collection
Metadata about the I/O access pattern is used to characterize inte
digital entity, recording the types of partial file reads, writes,
Template based metadata extraction applies a set of parsing rules
identify relevant attributes, extracts the attributes, and loads t
the logical collection.
Load balancing for a logical name space consists of distributing d
multiple storage systems
Synchronous updates correspond to finishing both the data manipula
associated metadata updates before the request is completed.
Asynchronous updates correspond to completion of a request within
system, after the return was given to a command.
Storage completion at end of single write corresponds to synchrono
Dynamic network tuning consists of adjusting the network transport
parameters for each data transmission to change the number of mess
before acknowledgements are required (window size) and the size of
buffer that holds the copy of the messages until the acknowledgeme
SDLIP is the Simple Digital Library Interoperability Protocol. It
information for the digital library community
Federated server architecture refers to the ability of distributed
themselves without having to communicate through the initiating cl
Third party transfer is the ability of two remote servers to move
themselves, without having to move the data back to the initiating
Bulk metadata load is the ability to import attribute values for m
registered within the logical name space from a single input file.
GSI authentication is the use of the Grid Security Infrastructure
to the logical name space, and to authenticate servers to other se
15
•
federated server architecture
DataCutter is the data filtering service developed by Joel Saltz a
University, which is executed directly on a remote storage system.
16
Appendix C. Capability Summary:
A consensus on the approach towards building data grids can be gather
which features are implemented by at least five of the seven surveyed
the eleven categories of capabilities covered by the comparison, the
represent a standard approach. The number of grids that provided a g
in parentheses, with the default value being all of the grids.
Logical name space
Logical name space independence from physical name space
Hierarchical logical folders (5)
Management of attributes used for each capability (registration,
deletion)
Deletion of entities from logical name space
Soft links between objects in logical folders (6)
Support for collection owned data (5)
Registration of files into logical name space
Logical name space attributes
Replica storage location, local file name
Group access control lists (5)
Bulk asynchronous load of attributes (5)
User defined attributes (5)
Attribute manipulation
Automated size, time stamp
Synchronous attribute update
Asynchronous annotation (6)
Data Manipulation
Synchronous replica creation (6)
Data Access
Parallel I/O support (6)
Transmission status checking (6)
Transmission restart at application level
Synchronous storage write
Standard error messages
Thread safe client (5)
Static network tuning (5)
GridFTP support (5)
Access APIs
C++ I/O library API (5)
Command line interface
Java interface (6)
Web service interface (5)
Architecture
Distributed client server
Federated server (6)
Distributed storage system access
Third party transfer (5)
GSI authentication (5)
Latency Management
Streaming (6)
Caching
17
Replication (6)
Staging (5)
System Support
Storage Resource Manager interface (5)
Archive interface to at least one system
Single catalog server (6)
Performance for import/export of files greater than 20 files per sec
(5)
Management of file transfer errors (5)
18
Appendix D: Persistent Archive Components
In Appendix A, a detailed list is provided to indicate whether a part
feature should be part of persistent archive infrastructure. The dec
whether the capability will make it feasible to manage a data collect
underlying technology changes. Mechanisms that support an abstractio
systems, and an abstraction of information repositories are included
persistent archive capabilities. A total of 80 capabilities are iden
persistent archive, with another 50 capabilities that would provide a
environment.
In this appendix, the capabilities are sorted into the components use
accessioning, arrangement, description, preservation, and access. Note that the
capabilities that are used across multiple archival processes are listed in the first process
where they are needed.
Appraisal Components:
− Selection of materials to archive – Support for logical name space to register the
materials, registration of handles (URLs) as well as files, organization of selected
material into a collection of retained material.
Persistent
Archive
Capability
Logical name space
Yes
Logical name space independence from physical name space
Yes
Hierarchical logical folders
Yes
Unix node operations on logical name space — directory man
delete, create and remove files and folders
Yes
Recursive operations on logical name space directories for
E
Deletion of entities from logical name space
Delete by setting a deletion attribute
Yes
E
Delete by removing attribute
Soft links between objects in logical folders so that a si
multiple folders
Data referenced by catalog owned by a Collection ID
Yes
Yes
Registration of files as objects in logical name space
Yes
Yes
Registration of databases as objects in logical name space
E
Registration of persistent handles as objects in logical n
Yes
Accessioning Components:
− Controlled data import – Support for validating data models, importing data over a
network, authenticating submitters, and loading data.
19
Capability
Persistent
Archive
Data Access
Yes
Parallel I/O support
E
Parallel I/O on get/put commands
E
Parallel I/O on partial file reads/writes
Transmission status checking at the file level
E
Yes
Transmission block tagging to support transmission restart
E
Transmission restart after interruption at application lev
Yes
Storage completion at end of single write
Yes
Standard error messages from storage systems, network, and
Yes
Striping support
E
Thread safe client
E
Static network tuning
Yes
Dynamic network tuning (window and buffer size)
E
GridFtp protocol support
E
Java parallel custom control protocol
E
Push and Pull data movement
E
Persistent
Archive
Capability
Attribute manipulation
Yes
Automated attribute generation for size, time stamp
Yes
User managed synchronous attribute update
Yes
Asynchronous annotation of objects in logical name space
Bulk asynchronous load of attributes
Bulk asynchronous load of attributes from XML file
E
Yes
E
Persistent
Archive
Capability
Latency Management
Yes
Streaming
Yes
Caching
Yes
Containers for data
Yes
Containers for metadata
Remote I/O proxies for aggregating I/O commands, remote da
extraction
Remote Proxies through
DataCutter
Yes
Remote Proxies through
GridFTP
Yes
E
E
Staging
Yes
Status checking
Yes
20
Persistent
Archive
Capability
Multiple system support
Yes
Storage Resource Manager interface
E
Database interface for reading/writing blobs
E
Database interface for queries to relational database regi
E
Database interface for exporting attributes as an XML file
Yes
Archive interface to at least one archive
Yes
Archive interface to HPSS
E
Archive interface to DMF
E
Archive interface to ADSM
E
Archive interface to UniTree
E
Archive interface to HPSS for storing file sizes greater t
E
Persistent
Archive
Capability
Distributed client-server architecture
Yes
Federated client server
Yes
Distributed servers
Yes
Distributed storage systems
Yes
64-bit name space
Yes
Logical resources that represent multiple physical resourc
E
Third party transfer from storage system to specified remo
Yes
GSI authentication
Yes
PKI authentication
E
Persistent
Archive
Capability
Performance enhancements
Yes
Performance for import/export of files greater than 20 fil
Yes
Performance for import/export of files greater than 1100 f
Yes
Bulk metadata load
Yes
Pre-spawned processes for data transfers
E
Database index optimization
E
Database communication tuning
E
21
Persistent
Archive
Capability
Robustness: Fault tolerance and error handling
Yes
Automatic retrials in metadata catalogue access
E
Automatic retrials in archive access
Data transfer resumption upon system restart (crash and re
user jobs
Single-command resumption of very long user jobs upon syst
Configurable time-outs on user usages of every resource (t
abandoned/misbehaved user jobs)
Handling of mass storage system software errors, including
compliance and other aberrations
Handling of underlying file transfer tools (such as ftp) e
compliance
File checksum
E
E
E
E
Yes
Yes
Yes
Arrangement Components:
− Context management – The logical name space can be used to create a logical
organization of the digital entities. Containers, the digital equivalent of a cardboard
box, are used to physically aggregate data entities.
Persistent
Archive
Capability
Data manipulation
Yes
Containers for aggregating small files
Yes
Container locking on writes
Yes
Updates to containers by appending data
Yes
Staging of containers from archive to disk
Yes
Synchronization of containers to archives
Yes
Multiple disk caches for containers
Yes
Multiple archives for storing containers
Yes
Logical container names that represent multiple physical c
Yes
Description Components:
− Global name space – Support for extensible sets of attributes that can be assigned to a
sub-collection within the collection/sub-collection hierarchy.
22
Persistent
Archive
Capability
Logical name space Attributes
Yes
System level attributes
Yes
I/O access pattern
E
Version attribute
Yes
User level attributes
Annotation attributes
Yes
E
Discipline specific attributes
Yes
Dublin Core attributes
Yes
Template based metadata extraction for catalog attributes
Yes
External catalog accessible for additional attributes
E
Preservation Components:
−
−
−
−
−
Instantiation – Support for proxies for data model transformative migrations.
Data access interoperabililty – Support for location transparency
Disaster recovery – Support for data replication
Persistency – Support for consistent metadata generation
Authenticity – Support for audit trails, and control of access to materials
Persistent
Archive
Capability
Logical name space Attributes
Yes
System level attributes
Replica attributes for storage location, local file name
Replica attributes for type of storage system
Yes
Yes
E
Role based access control lists for data in logical name s
Yes
Group access control lists for data in logical name space
Yes
Access control lists for logical name space attributes, to
and change metadata
Yes
Extended set of user roles (administrator, collection cura
Yes
Audit trails for updates and/or accesses
Yes
Persistent
Archive
Capability
Latency Management
Yes
Replication
Yes
23
Capability
Persistent
Archive
Data Access
Yes
Replication completion when write to k of n physical resou
E
Persistent
Archive
Capability
Data manipulation
Yes
Synchronous creation of replicas with associated metadata
Yes
Asynchronous creation of replicas
Master instances of replicas
E
Yes
Load balancing across physical resources
E
Replication of containers
Yes
Container replica invalidation of updates
Yes
Persistent
Archive
Capability
Multiple system support
Yes
Single catalog server, database technology used to replica
Yes
Access Components:
− Data transport – Support query based access to the collections, from multiple user
access environments. Support control for types of operations permitted on the
persistent archive based on user roles. Support both picking of single digital entities
and access to entire collections. Support display of digital entities.
Persistent
Archive
Capability
Attribute manipulation
Yes
Query interface to discover files by attributes
Yes
Export of attributes as XML file
Yes
Persistent
Archive
Capability
Multiple Access APIs
Yes
Remote data access by I/O redirection from Linux or Solari
E
C I/O library API
E
C++ I/O library API
E
Command line interface
Yes
Java interface
E
24
Web interface
Yes
Web Sevices Interface (WSDL)
Yes
XML query interface
E
Predicate assertion interface
E
SDLIP interface
E
Windows browser interface
E
Persistent
Archive
Capability
Robustness: Fault tolerance and error handling
Yes
Automatic fail over to alternate replica when the first co
Yes
25
Download