RDA Repository Platforms for Research Data Interest Group Use Case: iRODS Data Grids Author(s): Reagan Moore 1. Scientific Motivation and Outcomes There is a collection life cycle assoicated with research data. The stages include: Collection building, typically done by a project team. Each team member has detailed knowledge of the context assoicated with the collection. Collection sharing, typically between researchers at mutliple institutions. A context is needed for the collection to describe data formats, metadata, access controls. Publication in a digital library. The context for the collection must be broadened to include provenance information, discovery mechanisms (persistent DOIs), usage mechanisms Analysis in a processing pipeline. The context needs to include data parsing mechanisms, workflow registration, workflow provenance. Preservation in an archive. The context needs to include sufficient information for a future researcher to interpret the meaning and utility of the data. Each stage corresponds to a broader user community, and requires additional information for effective use of the collection. The broader context represents a set of assertions about the collection that are made by the collection developers. The assertions are enforced by management policies that govern the organization of the collection. These collection life cycle stages are addressed by the iRODS data grid through evolution of the management policies. New management policies are implemented as the usage pattern for the collection evolves. A standard outcome is the separation of discipline specific requirements from generic data management requirements. Each discipline using a data grid can apply their preferred access mechanism, define their preferred metadata schema, select their preferred data formats, and apply their preferred access controls, and enforce their preferred management policies. 2. Functional Description The iRODS Data Grid is a peer-to-peer server framework that organizes distributed data into a shareable collection. Middleware is installed at each location where data will be stored. The middleware traps operations at policy enforcement points, selects a policy from a local rule base, and controls the operation based upon the rule. This enables a highly flexible system that can be used for research collaborations, digital libraries, processing pipelines, distribution networks, and archives. Page 1 of 5 The iRODS data grid supports manages state information (provenance, description, system, audit) in a database, maps from operations to the protocol of each storaeg device, encapsulates operations in micro-services, and supports federation with existing data management systems. The system is pluggable, enabling the dynamic incorporation of new storage drivers, new microservices, new databases, new transport protocols, new authentication mechanisms. 3. Page 2 of 5 Achieved Results The iRODS data grid is used widely to build institutional repositories, regional data grids, national digital libraries, national data grids, national archives, and international collaborations. A partial list is given below organized by discipline: Archive Carolina Digital Repository, Taiwan National Archive Atmospheric science NASA Langley Atmospheric Sciences Center Biology Phylogenetics at CC IN2P3 Climate NOAA National Climatic Data Center Cognitive Science Temporal Dynamics of Learning Center Computer Science GENI experimental network Cosmic Ray AMS experiment on the International Space Station Digital Library French National Library, Texas Digital Libraries, SILS Earth Science NASA Center for Climate Simulations Ecology CEED Caveat Emptor Ecological Data Engineering CIBER-U Genomics Wellcome Trust Sanger Institute, RENCI High Energy Physics BaBar / Stanford Linear Accelerator Hydrology Institute for the Environment, UNC-CH; Hydroshare Medicine Lineberger Cancer Institute Neuroscience International Neuroinformatics Coordinating Facility Neutrino Physics T2K and dChooz neutrino experiments Optical Astronomy National Optical Astronomy Observatory Plant genetics the iPlant Collaborative Radio Astronomy Cyber Square Kilometer Array, TREND, BAOradio Social Science Odum, TerraPop The systems range in size from a few thousand files and gigabytes of data to a hundred million files and petabytes of data. The scale ranges from a single institution repository to an international collaboration. 4. Requirements Describe the requirements, their motivation from your use case and how you rate their importance. The descriptions don't have to be comprehensive. Requirement Description Motivation from Use Case Importance (1 - very important to 5 - not at all important) Single sign-on Store data at any location without an account Collaborations across institutions 1 Collection virtualization Manage properties of the collection independently of the storage system Migration onto new technology 1 Logical naming Support file names Distributed systems 1 Page 3 of 5 and organization in collections independently of storage resource naming Access controls Authenticate every access and authorized every operation Proprietary and confidential data 1 Storage drivers Map from access protocol to storage protocol Heterogeneous storage systems 1 Metadata Provenance, description, administrative information Support metadata that are unique to a file or a collection 1 Policy enforcement points Control all operations with administrator defined rules Widely varying policies across applications 1 Micro-services Encapsulate operations in basic functions that can be chained into a workflow Data manipulation, report generation, integrity checks, messaging, retention, disposition, … 1 Audit trails Maintain a log of all events / operations Validation of assertions over time 1 Federation Enable interoperation with other existing data management systems Legacy systems 1 Workflows Register workflows as Research analysis executable objects, tasks track provenance of each workflow execution 1 Policy sets Policies for Discipline preservation, or requirements sharing, or proprietary data, or assessment, or report generation 1 Page 4 of 5 Integrity mechanisms Replication, checksums Central requirement of any data management system 1 Clients WebDAV, web browser, Cyberduck, FUSE, Java I/O, Python, Shell commands Different clients required by different groups 1 Parallel transport Use of multiple I/O streams for data transfer Move a terabyte file 1 Page 5 of 5