iRODS_UseCase_RepoPlat

advertisement
RDA Repository Platforms for Research Data
Interest Group
Use Case: iRODS Data Grids
Author(s): Reagan Moore
1. Scientific Motivation and Outcomes
There is a collection life cycle assoicated with research data. The stages include:
 Collection building, typically done by a project team. Each team member has detailed
knowledge of the context assoicated with the collection.
 Collection sharing, typically between researchers at mutliple institutions. A context is
needed for the collection to describe data formats, metadata, access controls.
 Publication in a digital library. The context for the collection must be broadened to
include provenance information, discovery mechanisms (persistent DOIs), usage
mechanisms
 Analysis in a processing pipeline. The context needs to include data parsing
mechanisms, workflow registration, workflow provenance.
 Preservation in an archive. The context needs to include sufficient information for a
future researcher to interpret the meaning and utility of the data.
Each stage corresponds to a broader user community, and requires additional information for
effective use of the collection. The broader context represents a set of assertions about the
collection that are made by the collection developers. The assertions are enforced by
management policies that govern the organization of the collection.
These collection life cycle stages are addressed by the iRODS data grid through evolution of the
management policies. New management policies are implemented as the usage pattern for the
collection evolves.
A standard outcome is the separation of discipline specific requirements from generic data
management requirements. Each discipline using a data grid can apply their preferred access
mechanism, define their preferred metadata schema, select their preferred data formats, and
apply their preferred access controls, and enforce their preferred management policies.
2. Functional Description
The iRODS Data Grid is a peer-to-peer server framework that organizes distributed data into a
shareable collection. Middleware is installed at each location where data will be stored. The
middleware traps operations at policy enforcement points, selects a policy from a local rule
base, and controls the operation based upon the rule. This enables a highly flexible system that
can be used for research collaborations, digital libraries, processing pipelines, distribution
networks, and archives.
Page 1 of 5
The iRODS data grid supports manages state information (provenance, description, system,
audit) in a database, maps from operations to the protocol of each storaeg device, encapsulates
operations in micro-services, and supports federation with existing data management systems.
The system is pluggable, enabling the dynamic incorporation of new storage drivers, new microservices, new databases, new transport protocols, new authentication mechanisms.
3.
Page 2 of 5
Achieved Results
The iRODS data grid is used widely to build institutional repositories, regional data grids,
national digital libraries, national data grids, national archives, and international collaborations.
A partial list is given below organized by discipline:
 Archive
Carolina Digital Repository, Taiwan National Archive
 Atmospheric science NASA Langley Atmospheric Sciences Center
 Biology
Phylogenetics at CC IN2P3
 Climate
NOAA National Climatic Data Center
 Cognitive Science
Temporal Dynamics of Learning Center
 Computer Science
GENI experimental network
 Cosmic Ray
AMS experiment on the International Space Station
 Digital Library
French National Library, Texas Digital Libraries, SILS
 Earth Science
NASA Center for Climate Simulations
 Ecology
CEED Caveat Emptor Ecological Data
 Engineering
CIBER-U
 Genomics
Wellcome Trust Sanger Institute, RENCI
 High Energy Physics BaBar / Stanford Linear Accelerator
 Hydrology
Institute for the Environment, UNC-CH; Hydroshare
 Medicine
Lineberger Cancer Institute
 Neuroscience
International Neuroinformatics Coordinating Facility
 Neutrino Physics
T2K and dChooz neutrino experiments
 Optical Astronomy
National Optical Astronomy Observatory
 Plant genetics
the iPlant Collaborative
 Radio Astronomy
Cyber Square Kilometer Array, TREND, BAOradio
 Social Science
Odum, TerraPop
The systems range in size from a few thousand files and gigabytes of data to a hundred million
files and petabytes of data. The scale ranges from a single institution repository to an
international collaboration.
4. Requirements
Describe the requirements, their motivation from your use case and how you rate their
importance. The descriptions don't have to be comprehensive.
Requirement
Description
Motivation from Use
Case
Importance (1 - very
important to 5 - not at
all important)
Single sign-on
Store data at any
location without an
account
Collaborations across
institutions
1
Collection
virtualization
Manage properties of
the collection
independently of the
storage system
Migration onto new
technology
1
Logical naming
Support file names
Distributed systems
1
Page 3 of 5
and organization in
collections
independently of
storage resource
naming
Access controls
Authenticate every
access and
authorized every
operation
Proprietary and
confidential data
1
Storage drivers
Map from access
protocol to storage
protocol
Heterogeneous
storage systems
1
Metadata
Provenance,
description,
administrative
information
Support metadata
that are unique to a
file or a collection
1
Policy enforcement
points
Control all operations
with administrator
defined rules
Widely varying
policies across
applications
1
Micro-services
Encapsulate
operations in basic
functions that can be
chained into a
workflow
Data manipulation,
report generation,
integrity checks,
messaging, retention,
disposition, …
1
Audit trails
Maintain a log of all
events / operations
Validation of
assertions over time
1
Federation
Enable interoperation
with other existing
data management
systems
Legacy systems
1
Workflows
Register workflows as Research analysis
executable objects,
tasks
track provenance of
each workflow
execution
1
Policy sets
Policies for
Discipline
preservation, or
requirements
sharing, or
proprietary data, or
assessment, or report
generation
1
Page 4 of 5
Integrity mechanisms
Replication,
checksums
Central requirement
of any data
management system
1
Clients
WebDAV, web
browser, Cyberduck,
FUSE, Java I/O,
Python, Shell
commands
Different clients
required by different
groups
1
Parallel transport
Use of multiple I/O
streams for data
transfer
Move a terabyte file
1
Page 5 of 5
Download