Persistent Archive Concept Paper Persistent Archive Research Group Global Grid Forum Draft 3.0 Reagan W. Moore (San Diego Supercomputer Center) moore@sdsc.edu June 30, 2002 The Persistent Archive Research Group of the Grid Forum promotes the development of an architecture for the construction of persistent archives. The architecture addresses the issue of management of technology evolution. During the lifetime of the persistent archive, each component may be upgraded multiple times. The challenge is creating an architecture that maintains the authenticity of the archived collection, while minimizing the effort needed to incorporate new technology. Fortunately, persistent archives are conceptually equivalent to virtual data grids. They both need mechanisms to support access to heterogeneous types of storage systems and information repositories, while supporting the re-creation of derived data products. Hence the persistent archive research group is addressing issues related to how persistent archives can be built from virtual data grids. Virtual data grids are distributed systems that tie together data management systems and compute resources. Virtual data grids are differentiated from data grids by the ability to access derived data products and to re-create the derived data products from a process description. A persistent archive manages data collections that are treated as derived data products. A description of the processing steps used to create a collection can be archived. A request for access to the collection can result in a query against an instantiated version of the collection, or the execution of the processing steps to instantiate the collection. Hence a persistent archive is a virtual data grid that manages access to derived data collections, and manages the re-creation of derived data collections if they are not already instantiated. Persistent archives also manage the migration of data from old storage systems to new storage systems and manage transformative migrations, in which the encoding standard used to describe a data entity is changed to a new standard. Management of transformative migrations again is equivalent to management of derived data products in a virtual data grid. Archivists rely on persistent archives to support archival processes, including appraisal, accessioning, arrangement, description, preservation, and access. A virtual data grid provides support for data entities, which are either the data collection or the digital entities organized by the data collection. The virtual data grid capabilities can be separated into the functionalities that support archival practice: Appraisal: − Selection of materials to archive – Data grids provide a logical name space that supports the organization of digital entities into collection/sub-collection hierarchies. 1 The logical name space is decoupled from the underlying storage systems, making it possible to sort digital entities without moving them. It is possible to represent a digital entity by a handle (such as a URL), register the pointers into the logical name space, and organize the pointers independently of the data. Accessioning: − Controlled data import – Data grids put digital entities under management control, such that automated processing can be done across an entire collection. Data grids provide support for proxies (remote processes) that can be used to validate data models, extract metadata, and authenticate the identification of the submitter. Arrangement: − Context management – Data grids use collections to define the context to associate with data entities. The context includes provenance information describing the processes under which the data entities were created, attributes used to support information discovery to identify an individual data entity, and relationships that can be used to determine whether associated attribute values are consistent with implied knowledge about the collection, or represent anomalies and artifacts. The context management also is used to control the level of granularity associated with the organization of the data entities into collections/sub-collections. Containers, the digital equivalent of a cardboard box, are used to physically aggregate data entities. Description: − Global name space – The ability to identify derived data products is based on persistent logical identifiers that are independent of the local storage system file names. For persistent archives this includes the ability to provide persistent logical identifiers for the data entities stored within the data collections. The hierarchical organization of the global name space into collections/sub-collections is extended by supporting sub-collection specific metadata. Each sub-collection is described by an extensible set of attributes that can be defined independently of other sub-collections. − Derived product characterization – Both the derived data products and the processes used to generate the derived data products can be characterized in virtual data grids. For persistent archives, the derived data product can be a data collection or the transformative migration of a data object. Infrastructure independent representations are used to describe both the derived data product and the processes used to re-create the derived data product. Preservation: − Instantiation – A virtual data grid provides the ability to execute a process description. For a persistent archive, this is the ability to instantiate a data collection from its infrastructure independent representation. − Data access interoperabililty – The ability to migrate data products between different types of storage systems is provided by data grids. For persistent archives, this includes the ability to migrate a collection from old database technology onto new database technology. 2 − Disaster recovery – Data grids manage replicas of digital entities, replicas of collection attributes, and replicas of collections. The replicas can be located at geographically remote sites, ensuring safety from local disasters. − Persistency – Virtual data grids provide a consistent environment, which guarantees that the administrative attributes used to identify derived data products always remain consistent with migrations performed on the data entities. The consistent state is extended into a persistent state through management of the information encoding standards used to create platform independent representations. The ability to migrate from an old representation of an information encoding standard to a new representation leads to persistent management of derived data products. − Authenticity – Data grids track all operations done on each data entity, to be able to guarantee that the information and knowledge content of each digital entity remains unchanged. Audit trails and digital signatures are used. Access: − Derived data product access – Virtual data grids provide direct access to the derived data product when it exists. This implies the ability to store information about derived data products within a management collection that can be queried. A similar management collection, or finding aid, is used to characterize the multiple data collections and contained data entities that are stored in a persistent archive. − Data transport – Data grids provide transport mechanisms for accessing data in a distributed environment that spans multiple administration domains. This includes support for moving data and metadata in bulk, while authenticating the user. Data grids also provide multiple roles for characterizing the allowed operations on the stored data, independently of the underlying storage systems. Users can be assigned the capabilities of a curator, with the ability to create new sub-collections, or annotator with the ability to add comments about the digital entities, or submitter, with the ability to write data into a specified sub-collection, or public user, with the ability to read selected sub-collections. Data grids provide the mechanisms needed to support distributed data access across heterogeneous data resources. The heterogeneous data resources represent different versions of storage systems and databases as they evolve over time. When a new infrastructure component is added to a persistent archive, both the old version and new version will be accessed simultaneously while the data and information content is migrated onto the new technology. Through use of replication, the migration can be done transparently to the users. Persistent Archive Functionality Requirements The requirements for a persistent archive can be expressed in general as "transparencies" that hide virtual data grid implementation details. Examples include digital entity name transparency, data location transparency, platform implementation transparency, encoding standard transparency, and authentication transparency or single sign-on systems. The capabilities of a persistent archive can be characterized as the set of “transparencies” needed to manage technology evolution. Implementations exist for at 3 least four key functionalities or transparencies that simplify the complexity of accessing distributed heterogeneous systems: − Name transparency – The ability to identify a desired digital entity without knowing its name can be accomplished by queries on descriptive attributes, organized as a collection. Persistent archives are inherently archives of collections of digital entities that map from unique attribute values to a global, persistent, identifier. − Location transparency – The ability to retrieve a digital entity without knowing where it is stored can be accomplished through use of a logical name space that maps from the global, persistent, identifier to a physical storage location and physical file name. If the data grid owns the digital entities (stored under the data grid user ID), the administrative attributes for storage location and file name can be self-consistently updated every time the digital entity is moved. − Platform implementation transparency – The ability to retrieve a digital entity from arbitrary types of storage systems can be accomplished through use of a data grid that understands the protocols needed to talk to the storage systems. Every time a new type of storage system is added to the persistent archive, a new driver is added to the data grid to map from the new storage access protocol to the data grid data transport protocol. − Encoding standard transparency – The ability to display a digital entity requires understanding the associated data model and encoding standard for information. If infrastructure independent standards are used for the data model and encoding standard (non-proprietary, published formats), a persistent archive can use transformative migrations to maintain the ability to display the digital entities. The transformative migrations will need to be defined only between the infrastructure independent standards. The infrastructure that supports the above transparencies exists in multiple data grid implementation. A research issue is whether additional features need to be added to virtual data grids to support persistent archives. An example is the need for authenticity, which implies a data grid in which only authorized actions can take place. Every operation within the persistent archive must be tracked, and the corresponding metadata updated to guarantee consistency of the metadata. A second research issue is whether the levels of abstraction associated with virtual data grids are consistent with operations in persistent archives. One can think of a data grid as the set of abstractions that manage differences across storage repositories, information repositories, knowledge repositories, and execution systems. Data grids also provide abstraction mechanisms for interacting with the objects that are manipulated within the grid, including data entities (logical namespace), processes (service characterizations or application specifications), and interaction environments (portals). The data grid approach can be defined as a set of services, and the associated APIs and protocols used to implement the services. The data grid is augmented with portals that are used to assemble integrated work environments to support specific applications or disciplines. A major question is whether a persistent archive is better implemented as a virtual data grid, 4 incorporating the required functionality directly into the grid, or as a portal, with the required authenticity and management control implemented as an application interface. Data Grid Implementations To better understand the current status of data grids, we present an analysis of the capabilities that are already provided by production systems. The Global Grid Forum is promoting the development of standards for the implementation of data grids. One of the challenges is defining the minimal set of functionalities that are needed to implement a data grid. Is it possible to define a common set of capabilities across all of the data grid, data collection, digital library, and persistent archive projects that are already in production? A second challenge is to understand the set of capabilities that may be required by future applications. To answer both questions, a comparison has been made between the Storage Resource Broker (SRB) data grid from the San Diego Supercomputer Center, the European DataGrid replication environment (based upon GDMP, a project in common between the European DataGrid and the Parti Data Grid, and augmented with an additional product DataGrid of the forEuropean storing and retrieving meta-data in relational databases called Spitf components ), the Scientific Data Management (SDM) data grid from Pacific Northwest Laboratory, the Globus toolkit, the Sequential Access using Metadata (SAM) data grid from Fermi National Accelerator Laboratory, the Magda data management system from Brookhaven National Laboratory, and the JASMine data grid from Jefferson National Laboratory. These systems have evolved as the result of input by user communities for the management of data across heterogeneous, distributed storage resources. EGP, SAM, Magda, and JASMine data grids support high energy physics data. The SDM system provides a digital library interface to archived data for PNL and manages data from multiple scientific disciplines. The Globus toolkit provides services that can be composed to create a data grid. The SRB data handling system is used in projects for multiple US federal agencies, including the NASA Information Power Grid (digital library front end to archival storage), the DOE Accelerated Strategic Computing Initiative (collection-based data management), the National Library of Medicine Visible Embryo project (distributed data collection), the National Archives Records Administration (persistent archive), the NSF National Partnership for Advanced Computational Infrastructure (distributed data collections for astronomy, earth systems science, and neuroscience), the Joint Center for Structural Genomics (data grid), and the National Institute of Health Biomedical Informatics Research Network (data grid). The systems we examine therefore include not only data grids, but also distributed data collections, digital libraries and persistent archives. Since the core component of each system is a data grid, we can expect common capabilities to exist across the multiple implementations. The systems that provide the largest number of features tend to have the most diverse set of user requirements. The comparison is an attempt at understanding what data grid architectures provide to 5 meet existing application requirements. The capabilities are organized into functional categories, such that a given capability is listed only once. The categories have been chosen based on the need to manage a logical name space or replica catalog, the management of attributes in the logical name space, the storage abstraction for accessing remote storage systems, the types of data manipulation, and the data grid architecture. Since the listed data grids have been in use for multiple years, the features that have been developed represent a comprehensive cross-section of the features in actual use by production systems. The table listing the capabilities is given in Appendix A. Terms used in the comparison are explained in Appendix B. Common Data Grid Capabilities: What is most striking is that common data grid capabilities are emerging across all of the data grids. Appendix C lists the common features organized by functional category. Each data grid implements a logical name space that supports the construction of a uniform naming convention across multiple storage systems. The logical name space is managed independently of the physical file names used at a particular site, and a mapping is maintained between the logical file name and the physical file name. Each data grid has added attributes to the name space to support location transparency, file manipulation, and file organization. Most of the grids provide support for hierarchical logical folders within the namespace, and support for ownership of the files by a community or collection ID. The logical name space attributes typically include the replica storage location, the local file name, and user-defined attributes. Mechanisms are provided to automate the generation of attributes such as file size and creation time. The attributes are created synchronously when the file is registered into the logical name space, but many of the grids also support asynchronous registration of attributes. Most of the grids support synchronous replica creation, and provide data access through parallel I/O. The grids check transmission status and support data transport restart at the application level. Writes to the system are done synchronously, with standard error messages returned to the user. At the moment, the error messages are different across each of the data grids. The grids have statically tuned the network parameters (window size and buffer size) for transmission over wide area networks. Most of the grids provide interfaces to the GridFTP transport protocol. The most common access APIs to the data grids are a C++ I/O library, a command line interface and a Java interface. The grids are implemented as distributed client server architectures. Most of the grids support federation of the servers, enabling third party transfer. All of the grids provide access to storage systems located at remote sites including at least one archival storage system. The grids also currently use a single catalog server to manage the logical name space attributes. All of the data grids provide some form of latency management, including caching of files on disk, streaming of data, and replication of files. 6 Persistent Archive Components Given a consensus on the set of capabilities provided by a data grid, it is possible to identify those capabilities that are relevant to the creation of a persistent archive. In Appendix A., a column was included to identify the relevance of each capability towards the implementation of a persistent archive. Capabilities were identified as either necessary (marked by a “Yes”), or as a useful extension (marked by an “E”). A description of how a particular data grid capability would be used by a persistent archive is provided in Appendix D. The description is organized from the perspective of the persistent archive community into the infrastructure components needed to support data ingestion, manage the persistent archive, and support discovery and access. Conclusion: By comparing the implementations of seven different data grids, a common set of data management capabilities have been identified for accessing data in distributed environments. Based upon these common capabilities, a set of core capabilities have been identified as the components of a data-grid based implementation of a persistent archive. The proposed set of core and extended grid capabilities needed to implement a persistent archive is open to debate. While a common set is evident for data grids, a demonstration is still needed that these services are sufficient for building a persistent archive. Prior efforts at identifying persistent archive components have identified support for collection creation through metadata extraction, creation of logical name spaces to manage data in distributed environments, collection building from archived forms, and discovery and access mechanisms as the central components of a persistent archive. This paper provides a first appraisal of which of the core grid operations and extended grid operations are needed to implement a persistent archive. Acknowledgements: The data grid and toolkit characterizations were provided by the following persons. This comparison was only possible through their support. Igor Terekhov (Fermi National Accelerator Laboratory), Torre Wenaus (Brookhaven National Laboratory), Scott Studham (Pacific Northwest Laboratory), Chip Watson (Jefferson Laboratory), Heinz Stockinger and Peter Kunszt (CERN), Ann Chervenak (Information Sciences Institute, University of Southern California), Arcot Rajasekar (San Diego Supercomputer Center). 7 Appendix A. A comparison of the capabilities provided by data grids In the following table, areas where an implementation is planned for marked with P . Areas where information has not been received are l capabilities have been organized into eleven categories, with up to t capabilities per category. A total of 152 capabilities are characteri quarters of the capabilities (120) have been implemented in at least About one-third of the capabilities (50) have been implemented in at grids. These 50 capabilities comprise the current core features of d An eighth column has been added to indicate whether a particular data be part of persistent archive infrastructure. The decision is based capability will make it feasible to manage a data collection while th technology changes. Mechanisms that support an abstraction of storag abstraction of information repositories are included as part of the p capabilities. For the persistent archive column, an E indicates an improve the ability to manage a persistent archive. A total of 80 ca identified as essential for a persistent archive, with another 50 cap provide an extended environment. Capability Logical name space Pers. Globus EDG Arch. Toolkit JASMine Magda SAM SDM SRB Yes Yes Yes Yes Yes Yes Yes Yes Logical name space independen Yes physical name space Yes P Yes Yes Yes Yes Yes P P Yes No No Yes Yes No No Yes Yes Yes No Yes No No Yes No No No Yes Yes Yes Yes Yes Yes Yes Yes No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes P P P Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Data referenced by catalog ow Yes Collection ID P P No Yes Yes Yes Yes Registration of files as obje Yes name space Yes Yes Yes Yes Yes Yes Yes Registration of databases as logical name space No No No Yes No Yes Hierarchical logical folders Yes Unix node operations on logic space — directory manipulatio Yes rename, delete, create and r and folders Recursive operations on logic space directories for both st E retrieval Deletion of entities from log Yes space Delete by setting a deletion E Delete by removing attribute Yes Soft links between objects in folders so that a single file Yes in multiple folders Data referenced by catalog ow No user ID E 8 Registration of database blob No in logical name space No Registration of persistent ha Yes objects in logical name space No Capability No Pers. Globus EDG Arch. Toolkit No No Yes No No No JASMine Magda SAM No Yes No Yes SDM SRB Logical name space Attributes Yes Yes Yes Yes Yes Yes Yes Yes System level attributes Yes Yes Yes Yes Yes Yes Yes Yes Replica attributes for storag Yes local file name Yes Yes Yes Yes Yes Yes Yes Replica attributes for type o system No No No Yes Yes Yes Yes E Role based access control lis Yes logical name space P P P No Yes Yes Group access control lists fo Yes logical name space P P P Yes Yes Yes No No Yes Yes Yes No Yes Yes Yes Yes Yes No Yes No Yes Access control lists for logi space attributes, to control Yes add, and change metadata Extended set of user roles (a Yes collection curator, annotator Access control lists for reso No Yes I/O access pattern E No Yes No Audit trails for updates and/ Yes No P No P No No Yes No No Yes Yes Yes No Yes No Yes No Yes Yes Yes Yes No Yes Yes Yes Yes Yes User defined attributes for d No Yes No Yes Yes Yes Yes User defined attributes for c No P No No Yes Annotation attributes E User profiles to describe sto No history Discipline specific attri Yes Yes Yes Dublin Core attributes Yes Template based metadata extra Yes catalog attributes No Version attribute User level attributes Yes Yes No Yes Yes Yes No No No P Yes Yes Yes No No No No P No P No No No No External catalog accessible f attributes E No Physics tags No P 9 Yes No Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Capability Pers. Globus EDG Arch. Toolkit Attribute manipulation Yes Query interface to discover f Yes attributes Automated attribute generatio Yes time stamp User managed synchronous attr Yes update Asynchronous annotation of ob E logical name pace Export of attributes as XML f Yes python file Bulk asynchronous load of att Yes Bulk asynchronous load of att E from XML or python file Capability Yes Yes JASMine Yes P Magda SAM SDM SRB Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes No No Yes Yes Yes Yes No No Yes Yes Yes Yes No No No No Yes No Yes SDM SRB Pers. Globus EDG Arch. Toolkit JASMine Magda SAM Data manipulation Synchronous creation of repli associated metadata creation Asynchronous creation of repl Registration of user owned da replica of an existing object name space Master instances of replicas Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes E P Yes Yes Yes Yes Yes No No No No No No Yes Yes P Yes Yes Load balancing across physica E No Yes No Yes Yes Yes Containers for aggregating sm Yes Logical aggregation of object No on tape Container locking on writes Yes Updates to containers by appe Yes Replication of containers No Yes No No No No No No Yes No No Yes No Yes No No No No No No N/A No Yes No No No No No Yes Yes No No No No No Yes Container replica invalidatio Yes Staging of containers from ar Yes disk Synchronization of containers Yes No No No No No Yes No No No No No Yes No No No No No Yes Multiple disk caches for cont Yes No No No No No Yes Multiple archives for storing Yes Logical container names that Yes multiple physical containers No No No No No Yes No No No No No Yes 10 Capability Data Access Pers. Globus EDG Arch. Toolkit JASMine Magda SAM SDM SRB Yes Yes Yes Yes Yes Yes Yes Yes Parallel I/O support E Yes Yes Yes Yes Yes No Yes Parallel I/O on get/put comma E Yes Yes Yes Yes Yes No Yes Parallel I/O on partial file Transmission status checking level E Yes Yes No No No No No Yes Yes Yes Yes Yes Yes No Yes E Yes Yes Yes No No Yes No Transmission restart after in Yes application level Yes Yes Yes Yes Yes Yes Yes Storage completion at end of Yes Yes Yes Yes Yes Yes Yes Yes E No No No No Yes Yes Yes Yes Transmission block tagging to transmission restart after in Replication completion when w of n physical resources Standard error messages from systems, network, and data ha Yes system Striping support E Yes Yes Yes Yes Yes Yes Yes No No Yes P Thread safe client Yes Yes No Yes Yes Yes E Static network tuning Yes Dynamic network tuning (windo E buffer size) GridFtp protocol support E Yes Yes Yes No P No Yes No Yes No No P No No No Yes P Yes P No P TCP/IP custom control protoco No No No No No P No Yes Java parallel custom control No No Yes No No No No E User-selectable transfer Gridftp, (scp No Push and Pull data movement Capability Multiple Access APIs Yes E Pers. Globus EDG Arch. Toolkit Yes Yes No Yes Yes Yes JASMine Magda SAM SRB Yes Yes Yes Yes Yes Yes Yes Remote data access by I/O red from Linux or Solaris system E Yes No No No No Yes C I/O library API E Yes Yes No No Yes Yes C++ I/O library API Command line interface Java interface Yes SDM E Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes E Yes P Yes Yes Yes Yes Web interface Yes P No Yes Yes Yes No Yes Web Sevices Interface (WSDL) Yes P P Visual Basic interface No No No No No No Yes Yes XML query interface E P Yes No No P No No DLL/API interface for Python No P P Yes No Yes Predicate assertion interface E No No No No Yes SDLIP interface E No No No No P Windows browser interface E No No No Yes Yes 11 No Capability Globus Pers. EDG Toolkit Arch. JASMine Magda SAM Distributed client-server arc Yes Yes Yes Yes Yes Federated client server Yes Yes Yes Yes Yes Distributed servers Yes Yes Yes Yes Yes Distributed storage systems Yes Yes Yes Yes Yes 64-bit name space Yes Logical resources that repres E physical resources Third party transfer from sto Yes to specified remote destinati Yes GSI authentication PKI authentication Yes Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes No P Yes E Yes Yes Yes No Yes No Yes P Yes Via P gsiftp No P Yes No No Yes No No No Parti Pers. Globus EDG Arch. Toolkit JASMine Magda SAM Yes Yes Yes Yes Yes Streaming Yes Yes Yes Yes Yes Caching Yes Yes Yes Yes Yes No No Yes No No Containers for metadata Yes Remote I/O proxies for aggreg commands, remote data filteri Yes metadata extraction Remote Proxies through DataCutter E No No Prefetch (partial file cachin No Remote Proxies through GridFTP Yes Yes Yes Latency Management Containers for data SRB No Challenge response authentica No Ticket based access control f accesses and restricted time No access Capability Yes SDM Yes Yes Yes SDM SRB Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes No Yes No Yes No No No Yes No No No No No Yes Yes E No Staging Yes Yes Status checking Yes Yes Replication Yes Yes 12 Yes No No No No Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes No Yes Capability Multiple system support Pers. Globus EDG Arch. Toolkit JASMine Magda SAM SDM SRB Yes Yes Yes Yes Yes Yes Yes Yes Storage Resource Manager inte E Database interface for readin E blobs Database interface for querie relational database registere E object Database interface for export Yes attributes as an XML file Archive interface to at least Yes Yes No Yes P Yes No Yes No No No No No Yes Yes No No No No Yes Yes No No No No Yes Yes Yes Yes Yes Yes Yes Archive interface to HPSS E Yes Yes No Yes No Yes Archive interface to DMF E Yes Yes No No No Yes Archive interface to ADSM E Yes No No No No Yes Archive interface to Enstore No No Archive interface to UniTree E Yes No No No No Yes Archive interface to JASMine Archive interface to HPSS for sizes greater than 2 GB Archive interface to Castor Single catalog server, databa technology used to replicate Hierarchical distributed cata No No No Yes No No No E Yes No No No No Yes No Yes No Yes Yes Yes Yes Yes Yes No P P No No Capability No Pers. Globus EDG Arch. Toolkit JASMine Yes Yes No No Magda SAM Performance enhancements Performance for import/export greater than 20 files/sec Performance for import/export greater than 1100 files/sec Bulk metadata load Yes P No Yes Pre-spawned processes for dat E No No No Database index optimization E Yes Yes Database communication tuning E Yes Yes No Yes No No SDM SRB Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No Yes No Yes No Yes No Yes No Yes Yes No No Yes 13 Yes Yes Yes Capability Robustness: Fault tolerance a handling Automatic fail over to altern when the first copy is unavai Automatic retrials in metadat access Automatic retrials in archive Data transfer resumption upon restart (crash and reboot), t the user jobs Single-command resumption of user jobs upon system restart Configurable time-outs on use every resource (to protect ag abandoned/misbehaved user job Handling of mass storage syst software errors, including pr compliance and other aberrati Handling of underlying file t (such as ftp) errors, includi protocol compliance File checksum Globus Pers. EDG Toolkit Arch. JASMine Magda SAM SDM SRB Yes Yes Yes Yes Yes Yes No No Yes Yes E No No Yes No E No Yes No Yes P E Yes Yes No Yes P E Yes P Yes Yes No E P Yes P Yes No Yes No Yes Yes Yes Yes Yes Yes P Yes Yes Yes Yes Yes P Yes No No No Yes 98 60 80 76 85 64 136 Number of capabilities (prese 130 planned) (total of 152) 14 Yes No Yes Appendix B. Definition of terms used in the capability comparison Terms used in Appendix A to describe data grid capabilities are defin • • • • • • • • • • • • • • • • • • • • Registration corresponds to adding objects to the logical name spa logical name and storing a pointer to the file name used on the st Attributes represent information that is managed for each object t the logical name space Folders are equivalent to directories in a file system, but are us in the logical name space Soft links represent the cross registration of a single physical d folders in the logical name space Shadow links represent pointers to objects owned by individuals. register individual owned data into the logical name space, withou of the object on storage systems managed by the logical name space Replicas are copies of a file registered into the logical name spa on either the same storage system or on different file systems. A container is an aggregation of multiple data files into a single Curation control corresponds to the administration tasks associate managing a logical collection Metadata about the I/O access pattern is used to characterize inte digital entity, recording the types of partial file reads, writes, Template based metadata extraction applies a set of parsing rules identify relevant attributes, extracts the attributes, and loads t the logical collection. Load balancing for a logical name space consists of distributing d multiple storage systems Synchronous updates correspond to finishing both the data manipula associated metadata updates before the request is completed. Asynchronous updates correspond to completion of a request within system, after the return was given to a command. Storage completion at end of single write corresponds to synchrono Dynamic network tuning consists of adjusting the network transport parameters for each data transmission to change the number of mess before acknowledgements are required (window size) and the size of buffer that holds the copy of the messages until the acknowledgeme SDLIP is the Simple Digital Library Interoperability Protocol. It information for the digital library community Federated server architecture refers to the ability of distributed themselves without having to communicate through the initiating cl Third party transfer is the ability of two remote servers to move themselves, without having to move the data back to the initiating Bulk metadata load is the ability to import attribute values for m registered within the logical name space from a single input file. GSI authentication is the use of the Grid Security Infrastructure to the logical name space, and to authenticate servers to other se 15 • federated server architecture DataCutter is the data filtering service developed by Joel Saltz a University, which is executed directly on a remote storage system. 16 Appendix C. Capability Summary: A consensus on the approach towards building data grids can be gather which features are implemented by at least five of the seven surveyed the eleven categories of capabilities covered by the comparison, the represent a standard approach. The number of grids that provided a g in parentheses, with the default value being all of the grids. Logical name space Logical name space independence from physical name space Hierarchical logical folders (5) Management of attributes used for each capability (registration, deletion) Deletion of entities from logical name space Soft links between objects in logical folders (6) Support for collection owned data (5) Registration of files into logical name space Logical name space attributes Replica storage location, local file name Group access control lists (5) Bulk asynchronous load of attributes (5) User defined attributes (5) Attribute manipulation Automated size, time stamp Synchronous attribute update Asynchronous annotation (6) Data Manipulation Synchronous replica creation (6) Data Access Parallel I/O support (6) Transmission status checking (6) Transmission restart at application level Synchronous storage write Standard error messages Thread safe client (5) Static network tuning (5) GridFTP support (5) Access APIs C++ I/O library API (5) Command line interface Java interface (6) Web service interface (5) Architecture Distributed client server Federated server (6) Distributed storage system access Third party transfer (5) GSI authentication (5) Latency Management Streaming (6) Caching 17 Replication (6) Staging (5) System Support Storage Resource Manager interface (5) Archive interface to at least one system Single catalog server (6) Performance for import/export of files greater than 20 files per sec (5) Management of file transfer errors (5) 18 Appendix D: Persistent Archive Components In Appendix A, a detailed list is provided to indicate whether a part feature should be part of persistent archive infrastructure. The dec whether the capability will make it feasible to manage a data collect underlying technology changes. Mechanisms that support an abstractio systems, and an abstraction of information repositories are included persistent archive capabilities. A total of 80 capabilities are iden persistent archive, with another 50 capabilities that would provide a environment. In this appendix, the capabilities are sorted into the components use accessioning, arrangement, description, preservation, and access. Note that the capabilities that are used across multiple archival processes are listed in the first process where they are needed. Appraisal Components: − Selection of materials to archive – Support for logical name space to register the materials, registration of handles (URLs) as well as files, organization of selected material into a collection of retained material. Persistent Archive Capability Logical name space Yes Logical name space independence from physical name space Yes Hierarchical logical folders Yes Unix node operations on logical name space — directory man delete, create and remove files and folders Yes Recursive operations on logical name space directories for E Deletion of entities from logical name space Delete by setting a deletion attribute Yes E Delete by removing attribute Soft links between objects in logical folders so that a si multiple folders Data referenced by catalog owned by a Collection ID Yes Yes Registration of files as objects in logical name space Yes Yes Registration of databases as objects in logical name space E Registration of persistent handles as objects in logical n Yes Accessioning Components: − Controlled data import – Support for validating data models, importing data over a network, authenticating submitters, and loading data. 19 Capability Persistent Archive Data Access Yes Parallel I/O support E Parallel I/O on get/put commands E Parallel I/O on partial file reads/writes Transmission status checking at the file level E Yes Transmission block tagging to support transmission restart E Transmission restart after interruption at application lev Yes Storage completion at end of single write Yes Standard error messages from storage systems, network, and Yes Striping support E Thread safe client E Static network tuning Yes Dynamic network tuning (window and buffer size) E GridFtp protocol support E Java parallel custom control protocol E Push and Pull data movement E Persistent Archive Capability Attribute manipulation Yes Automated attribute generation for size, time stamp Yes User managed synchronous attribute update Yes Asynchronous annotation of objects in logical name space Bulk asynchronous load of attributes Bulk asynchronous load of attributes from XML file E Yes E Persistent Archive Capability Latency Management Yes Streaming Yes Caching Yes Containers for data Yes Containers for metadata Remote I/O proxies for aggregating I/O commands, remote da extraction Remote Proxies through DataCutter Yes Remote Proxies through GridFTP Yes E E Staging Yes Status checking Yes 20 Persistent Archive Capability Multiple system support Yes Storage Resource Manager interface E Database interface for reading/writing blobs E Database interface for queries to relational database regi E Database interface for exporting attributes as an XML file Yes Archive interface to at least one archive Yes Archive interface to HPSS E Archive interface to DMF E Archive interface to ADSM E Archive interface to UniTree E Archive interface to HPSS for storing file sizes greater t E Persistent Archive Capability Distributed client-server architecture Yes Federated client server Yes Distributed servers Yes Distributed storage systems Yes 64-bit name space Yes Logical resources that represent multiple physical resourc E Third party transfer from storage system to specified remo Yes GSI authentication Yes PKI authentication E Persistent Archive Capability Performance enhancements Yes Performance for import/export of files greater than 20 fil Yes Performance for import/export of files greater than 1100 f Yes Bulk metadata load Yes Pre-spawned processes for data transfers E Database index optimization E Database communication tuning E 21 Persistent Archive Capability Robustness: Fault tolerance and error handling Yes Automatic retrials in metadata catalogue access E Automatic retrials in archive access Data transfer resumption upon system restart (crash and re user jobs Single-command resumption of very long user jobs upon syst Configurable time-outs on user usages of every resource (t abandoned/misbehaved user jobs) Handling of mass storage system software errors, including compliance and other aberrations Handling of underlying file transfer tools (such as ftp) e compliance File checksum E E E E Yes Yes Yes Arrangement Components: − Context management – The logical name space can be used to create a logical organization of the digital entities. Containers, the digital equivalent of a cardboard box, are used to physically aggregate data entities. Persistent Archive Capability Data manipulation Yes Containers for aggregating small files Yes Container locking on writes Yes Updates to containers by appending data Yes Staging of containers from archive to disk Yes Synchronization of containers to archives Yes Multiple disk caches for containers Yes Multiple archives for storing containers Yes Logical container names that represent multiple physical c Yes Description Components: − Global name space – Support for extensible sets of attributes that can be assigned to a sub-collection within the collection/sub-collection hierarchy. 22 Persistent Archive Capability Logical name space Attributes Yes System level attributes Yes I/O access pattern E Version attribute Yes User level attributes Annotation attributes Yes E Discipline specific attributes Yes Dublin Core attributes Yes Template based metadata extraction for catalog attributes Yes External catalog accessible for additional attributes E Preservation Components: − − − − − Instantiation – Support for proxies for data model transformative migrations. Data access interoperabililty – Support for location transparency Disaster recovery – Support for data replication Persistency – Support for consistent metadata generation Authenticity – Support for audit trails, and control of access to materials Persistent Archive Capability Logical name space Attributes Yes System level attributes Replica attributes for storage location, local file name Replica attributes for type of storage system Yes Yes E Role based access control lists for data in logical name s Yes Group access control lists for data in logical name space Yes Access control lists for logical name space attributes, to and change metadata Yes Extended set of user roles (administrator, collection cura Yes Audit trails for updates and/or accesses Yes Persistent Archive Capability Latency Management Yes Replication Yes 23 Capability Persistent Archive Data Access Yes Replication completion when write to k of n physical resou E Persistent Archive Capability Data manipulation Yes Synchronous creation of replicas with associated metadata Yes Asynchronous creation of replicas Master instances of replicas E Yes Load balancing across physical resources E Replication of containers Yes Container replica invalidation of updates Yes Persistent Archive Capability Multiple system support Yes Single catalog server, database technology used to replica Yes Access Components: − Data transport – Support query based access to the collections, from multiple user access environments. Support control for types of operations permitted on the persistent archive based on user roles. Support both picking of single digital entities and access to entire collections. Support display of digital entities. Persistent Archive Capability Attribute manipulation Yes Query interface to discover files by attributes Yes Export of attributes as XML file Yes Persistent Archive Capability Multiple Access APIs Yes Remote data access by I/O redirection from Linux or Solari E C I/O library API E C++ I/O library API E Command line interface Yes Java interface E 24 Web interface Yes Web Sevices Interface (WSDL) Yes XML query interface E Predicate assertion interface E SDLIP interface E Windows browser interface E Persistent Archive Capability Robustness: Fault tolerance and error handling Yes Automatic fail over to alternate replica when the first co Yes 25