Federation Use Case Reagan Moore DataNet Federation Consortium The integrated Rule Oriented Data System is used to build research repositories, data grids for sharing data, digital libraries for publishing data, archives for data preservation, and processing pipelines for analyzing data. The system is used by more than 25 science and engineering domains on academic, national, and international collaborations. The communities using the software include: Atmospheric science Biology Climate Cognitive Science Computer Science Cosmic Ray Earth Science Ecology Engineering High Energy Physics Hydrology Genomics Neuroscience Neutrino Physics Optical Astronomy Plant genetics Radio Astronomy Social Science NASA Langley Atmospheric Sciences Center Phylogenetics at CC IN2P3 NOAA National Climatic Data Center Temporal Dynamics of Learning Center GENI experimental network AMS experiment on the International Space Station NASA Center for Climate Simulations CEED Caveat Emptor Ecological Data CIBER-U BaBar / Stanford Linear Accelerator Institute for the Environment, UNC-CH; Hydroshare Broad Institute, Wellcome Trust Sanger Institute, NGS International Neuroinformatics Coordinating Facility T2K and dChooz neutrino experiments National Optical Astronomy Observatory the iPlant Collaborative Cyber Square Kilometer Array, TREND, BAOradio Odum, TerraPop The collective requirements from these communities can be organized into categories. Note that all of these capabilities are provided by the iRODS data grid. Distribution Organize data from multiple institutions into a shareable collection. Manage access controls on the distributed data Support access to data in multiple types of storage systems (file systems, tape archives, object stores, web sites, databases, etc.) Manage distributed state information Virtualization Manage the name space for users across institutions. This is equivalent to a single-sign-on environment and implements a global name space for users. Manage the name space for files across storage systems Manage the name space for collections (organize data into folders/directories independently of storage location and storage type) Manage the name space for storage systems (enable formation of groups of storage systems) Manage the name space for metadata (use schema indirection, storing the name of the attribute, the attribute value, and an attribute comment field) Manage the name space for policies (enable policies as computer actionable rules) Manage the name space for basic operations – micro-services (enable composition of workflows by chaining micro-services) Interoperability Enable interaction with existing systems. o Authentication – InCommon, GSI, Kerberos, Shibboleth, LDAP, PAM o Data access – DataONE, Data Conservance, Hydroshare, NCDC, CUAHSI o Data manipulation – NetCDF, HDF, THREDDS, ERDDAP, FITS, DICOM o Workflows – Kepler, Taverna, NCSA Cyberintegrator, Docker o Networks – TCP/IP, RBUDP, Parallel I/O, HTTPS o Clients – Web browsers, web services, workflows, FUSE, Cyberduck, Webdav, Mediawiki, Fedora/Modeshape, Dataverse/Modeshape o Messaging – AMGP, STOMP o Vocabulary – HIVE o Rule engine – iRODS rule language, Python rules, C rules Data types Manipulate netCDF files (parse, subset, modify) Manipulate HDF files (parse, subset) Manipulate FITS file (extract metadata) Manipulate DICOM files Manipulate image files (extract excif metadata) Metadata Pattern-based parsing of metadata from text Bulk load from pipe-delimited files (structured metadata input) Bulk load from XML files Index metadata in Elasticsearch or SOLR Index text in Elasticsearch or SOLR Support metadata on collections, files, users, storage systems, policies, microservices Ontologies – HIVE for reserved vocabularies Management policies Enforce policies at policy-enforcement points Enforce periodic policies Enforce interactive policies Automate administrative functions Validate assessment criteria Example policies include: o Integrity – file replication, checksum creation o Authenticity – load provenance metadata o Chain of custody – track storage location, manage access controls o Original arrangement – track source o Ingest – manage staging area, synchronization, bulk load o Retention – manage time to live o Disposition – manage end of life, archiving, e-mail o Description – automated metadata generation o Grouping – management of containers of files o Processing – automated workflow execution Federation Tightly coupled federation – shared name spaces between the independent repositories Loosely coupled federation – interaction with remote system for query, list, get, put Asynchronous federation – interaction through a message bus to decouple interactions between the systems Policy-based federation – policies to control access from external systems Tickets – controlled interactions with anonymous users Scalability Individual digital library (1000 files, one user) Collaboration environment (100 million files, 20,000 users, multiple PBs) Preservation environment (100 million files) Processing pipeline (initiate compuation workflows on HPC systems) Parallel I/O – integration with Software Defined Networks Architecture Peer-to-peer servers Pluggable – dynamic addition of storage drivers, database drivers, authentication systems, network protocols, policies, micro-services, rule engines