EUDAT AAI for a Collaborative Data Infrastructure - Challenges and Approaches Johannes Reetz, EUDAT VAMP workshop Helsinki, 30 Sep 2013 The CDI concept Collaborative Data Infrastructure Data Curation Trust Data Generators Users User-focused functionality, data capture & transfer, VREs Community Support Services Data discovery & navigation, workflow creation, annotation, interpretability Common Data Services Persistent storage, identification, authenticity, workflow execution, mining 2 Initially six research communities on Board • • • • • • EPOS: European Plate Observatory System CLARIN: Common Language Resources and Technology Infrastructure ENES: Service for Climate Modelling in Europe LifeWatch: Biodiversity Data and Observatories VPH: The Virtual Physiological Human INCF: International Neuroinformatics • All share common challenges: – – – – – Reference models and architectures Persistent data identifiers Metadata management Distributed data sources Data interoperability 3 Communities and Data Centers Identifying basic requirements Identify commonalities, common data services What community users see … Community portal, single credential type Community Layer Community specific authentication, authorization & single sign-on commutity data What community users see … EUDAT portal, for non-affiliated users, many credential types Various community portals, different credential types common metadata exploration common data stage-in and stage-out services data services for the long tail data, also from citizen scientists common replication services with access to distributed storage Unified Authentication, Authorization & Single Sign-On community data data commutity data Other very useful from: Analysis of the FIM doc (v0.7, L. Florio et al. 2013) 1. User friendliness (high) 2. Browser & non-browser federated access (high) 3. Multiple technologies with translators including dynamic issue of credentials (medium) (high) (high) 4. Bridging communities (medium) 5. Implementations based on open standards and sustainable with compatible licenses (high) 6. Different Levels of Assurance with provenance (high) 7. Authorisation under community and/or facility control (high) 8. Attributes must be able to cross national borders(high) (high) 9. Well defined semantically harmonised attributes(medium) 10. Flexible and scalable IdP attribute release policy(medium) EUDAT supports these requirements, but emphasizes #3, #4 and #9 EUDAT Sites community centres repositories general data centres (replica) storages Safe Replication Service • Robust, safe and highly available data replication service for small- and medium- sized repositories – To guard against data loss in long-term archiving and preservation – To optimize access for user from different regions – To bring data strategically closer to systems for powerful computeintensive analysis PIDs • Policy rules – PIDs are used to keep EUDAT CDI Domain of registered data track on location and can provide attributes 9 Use Case: CLARIN – Safe Replication EPIC PID registry Safe Replication “islands” INCF EPOS / Orpheus diXa ENES /CMIP5,IPCC-AR5 CLARIN / Replix community centres repositories CLARIN / CUNI VPH / VIP CLARIN / CUNI general data centres NeuGrid replica storages EPOS / PP WG7 Data Staging Service • Support researchers in transferring large data collections from EUDAT storage to HPC facilities • Reliable, efficient, and easy-to-use tools to manage data transfers • Provide the means to ingest PRACE computational results into HPC the repository via the EUDAT infrastructure HPC EUDAT CDI Domain of registered data 12 EUDAT Services (1) Safe Replication Service • Replicating Data Objects (DO) from a Repository to Replica Storages • Repository & Replica Storage belong to separate administrative zones • Registration of Original DO and Replica PID / object identifier Service • Create DO handles • Manages/Maintain DO handles • Resolve DO handles Data Staging Service • Replication of Data from the domain of registered data (Stage-Out) • Replication of data objects into the domain of registered data (Stage-In) • Replication of not-registered Data Objects between scratch storages 13 Service specific actors/actions (1) Safe Replication Service • Repository Data Manager replicates • Replica Storage Manager registers DOs • 1) (community) user access data via repository • 2) User access data via replica storage PIDs • Policy rules EUDAT CDI Domain of registered da PID (Handle) Service • Repository Data Manager: creates/manages primary object handle • Replica Storage Manager: creates/manages secondary object handles • Users and others resolves the location of the physical storage the handles (PIDs) Data Staging • Users access and fetch data from either the repository or the replica storage • User ingest new data into the repository 14 Simple Store for ”long-tail” data and the Citizen scientists • Allow registered users to upload ”long tail” data into the EUDAT store • Enable sharing objects and collections with other researchers • Utilise other EUDAT services to provide reliability and data retention • PIDs are assigned to uploaded DO Simple upload Simple metadata PID registration EUDAT CDI Domain of registered data Joint Metadata Service • Find and define collections of scientific data – generated either by various communities or via EUDAT services (e.g. facetted search) • Access those data collections through the given references in the metadata to the relevant data stores EUDAT CDI Domain of registered data Definition of the data sets as objects for entitlement EUDAT Services (2) Simple Store Service • • • Repository for registered data with metadata for the sharing Digital objects are registered (handles are assigned) Fragmented User Group: many communities & „citizen scientists“ are contributing and retrieving data EUDATbox Service • • • Temporary shareable storage space for data, not necessarily registered User deposits data – not necessarily with metadata Not a homogeneous user group: many communities, „citizen scientists“ (Joint) Metadata Service • • Metadata from various repositories are harvested and collected Metadata exploration, facetted search: result sets define data set for entitlement 17 Service specific actors/actions (2) Simple Store (Repository) • • • Users deposit data and metadata User search for and access data Repository Storage Manager (needs to create the handle service) EUDAT box • • • User deposit data User shares data by inviting other users User access data (Joint) Meta Data Service • • Manager harvests metadata from (many) repositories also via the replica site EUDAT CDI Domain of registered data 18 * IdP A • zoned credential conversion service • unique user Ids, project-wise mapped to • attribute based access control information OpenID AtP 1 AtP 2 AtP 3 Attribute Provider AuthZ either community-managed or ( ) attributes provided by user’s home IdP are reused * 20 EUDAT AAI-TF approach ConSec: Contrail Security code 21 The Figure shows the high level view: SAML is used for authentication (possibly translated from OpenID (not shown)); OAuth (version 2) is used for delegation (internally, within the federation), and XACML is used for access control policies. Control (in the workflow sense) roughly goes from left to right and from top to bottom. Internally, an X.509 certificate with authorisation attributes is generated; this certificate is also managed internally and thus not usually exposed to (or accessible by) the user. Its purpose is threefold: (a) to ensure that non-HTTP services can be accessed (i.e., outside the OAuth delegation workflow), such as GridFTP and iRODS, and (b) to allow fine-grained authorisation, and (c) to allow command line access to services for expert users. In OAuth, the authorisation server remains the central hub where access is delegated. However since, EUDAT needs finer grained access, so the generated X.509 certificate carries also authorisation attributes (see below), which are checked against pre-defined access policies. The system deployed and used by EUDAT was built by the Contrail project, so we are reusing the Contrail Security (ConSec) code and tools developed within this pilot project. This decision was based on the evaluation of options, where ConSec promised most of the features required by the EUDAT communities. EUDAT is currently running a ConSec authentication infrastructure for integration at FZJ. EUDAT is currently not running an authorisation infrastructure. 22