Repository-use-case-from-DFC

advertisement
Federation Use Case
Reagan Moore
DataNet Federation Consortium
The integrated Rule Oriented Data System is used to build research repositories,
data grids for sharing data, digital libraries for publishing data, archives for data
preservation, and processing pipelines for analyzing data. The system is used by
more than 25 science and engineering domains on academic, national, and
international collaborations. The communities using the software include:


















Atmospheric science
Biology
Climate
Cognitive Science
Computer Science
Cosmic Ray
Earth Science
Ecology
Engineering
High Energy Physics
Hydrology
Genomics
Neuroscience
Neutrino Physics
Optical Astronomy
Plant genetics
Radio Astronomy
Social Science
NASA Langley Atmospheric Sciences Center
Phylogenetics at CC IN2P3
NOAA National Climatic Data Center
Temporal Dynamics of Learning Center
GENI experimental network
AMS experiment on the International Space Station
NASA Center for Climate Simulations
CEED Caveat Emptor Ecological Data
CIBER-U
BaBar / Stanford Linear Accelerator
Institute for the Environment, UNC-CH; Hydroshare
Broad Institute, Wellcome Trust Sanger Institute, NGS
International Neuroinformatics Coordinating Facility
T2K and dChooz neutrino experiments
National Optical Astronomy Observatory
the iPlant Collaborative
Cyber Square Kilometer Array, TREND, BAOradio
Odum, TerraPop
The collective requirements from these communities can be organized into
categories. Note that all of these capabilities are provided by the iRODS data grid.
Distribution
 Organize data from multiple institutions into a shareable collection.
 Manage access controls on the distributed data
 Support access to data in multiple types of storage systems (file systems, tape
archives, object stores, web sites, databases, etc.)
 Manage distributed state information
Virtualization
 Manage the name space for users across institutions. This is equivalent to a
single-sign-on environment and implements a global name space for users.
 Manage the name space for files across storage systems





Manage the name space for collections (organize data into
folders/directories independently of storage location and storage type)
Manage the name space for storage systems (enable formation of groups of
storage systems)
Manage the name space for metadata (use schema indirection, storing the
name of the attribute, the attribute value, and an attribute comment field)
Manage the name space for policies (enable policies as computer actionable
rules)
Manage the name space for basic operations – micro-services (enable
composition of workflows by chaining micro-services)
Interoperability
 Enable interaction with existing systems.
o Authentication – InCommon, GSI, Kerberos, Shibboleth, LDAP, PAM
o Data access – DataONE, Data Conservance, Hydroshare, NCDC, CUAHSI
o Data manipulation – NetCDF, HDF, THREDDS, ERDDAP, FITS, DICOM
o Workflows – Kepler, Taverna, NCSA Cyberintegrator, Docker
o Networks – TCP/IP, RBUDP, Parallel I/O, HTTPS
o Clients – Web browsers, web services, workflows, FUSE, Cyberduck,
Webdav, Mediawiki, Fedora/Modeshape, Dataverse/Modeshape
o Messaging – AMGP, STOMP
o Vocabulary – HIVE
o Rule engine – iRODS rule language, Python rules, C rules
Data types
 Manipulate netCDF files (parse, subset, modify)
 Manipulate HDF files (parse, subset)
 Manipulate FITS file (extract metadata)
 Manipulate DICOM files
 Manipulate image files (extract excif metadata)
Metadata
 Pattern-based parsing of metadata from text
 Bulk load from pipe-delimited files (structured metadata input)
 Bulk load from XML files
 Index metadata in Elasticsearch or SOLR
 Index text in Elasticsearch or SOLR
 Support metadata on collections, files, users, storage systems, policies, microservices
 Ontologies – HIVE for reserved vocabularies
Management policies
 Enforce policies at policy-enforcement points
 Enforce periodic policies
 Enforce interactive policies



Automate administrative functions
Validate assessment criteria
Example policies include:
o Integrity – file replication, checksum creation
o Authenticity – load provenance metadata
o Chain of custody – track storage location, manage access controls
o Original arrangement – track source
o Ingest – manage staging area, synchronization, bulk load
o Retention – manage time to live
o Disposition – manage end of life, archiving, e-mail
o Description – automated metadata generation
o Grouping – management of containers of files
o Processing – automated workflow execution
Federation
 Tightly coupled federation – shared name spaces between the independent
repositories
 Loosely coupled federation – interaction with remote system for query, list,
get, put
 Asynchronous federation – interaction through a message bus to decouple
interactions between the systems
 Policy-based federation – policies to control access from external systems
 Tickets – controlled interactions with anonymous users
Scalability
 Individual digital library (1000 files, one user)
 Collaboration environment (100 million files, 20,000 users, multiple PBs)
 Preservation environment (100 million files)
 Processing pipeline (initiate compuation workflows on HPC systems)
 Parallel I/O – integration with Software Defined Networks
Architecture
 Peer-to-peer servers
 Pluggable – dynamic addition of storage drivers, database drivers,
authentication systems, network protocols, policies, micro-services, rule
engines
Download