RDA Repository Platforms for Research Data Interest Group Use Case: Institutional Life-cycle Research Data Management Author(s): Eric Maris This is a use case description of the “Repository Platforms for Research Data” IG. While points 1, 2 and 3 aim at a general description/overview of the use case, point 4 is meant to list the requirements. Please, save the file using the name scheme: UseCaseName_UseCase_RepoPlat.docx 1. Scientific Motivation and Outcomes For my institute (the Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands), we need a facility for 1. Preserving our research data. 2. Documenting the scientific process (as a means to increase the reproducibility of our scientific results) 3. Sharing the data of published studies with the scientific community. We will realize these goals by a set of protocols that are to be used in combination with a digital repository. The best possible outcome would be a set of protocols that is fully adopted by all members of the institute and a digital repository with an easy-to-use interface that provides all the functionality that is specified in the protocols. 2. Functional Description To realize the three goals described under 1., we have defined three collection types: 1. Data acquisition collections (DACs) 2. Scientific integrity collections (SICs) 3. Data sharing collections (DSCs) Each of these collections has its own set of data metadata, and their construction is described in the protocols. 3. Achieved Results We have finished the protocols for the users, a test version of the system (with limited functionality) is running, and we are building a web-client. We have not yet tested with naïve users. 4. Requirements Note Page 1 of 7 I found it difficult to describe the IT requirements of our system in the form of a table. After doing it, it felt like much of the underlying ideas were lost. I therefore opted for a free text description. Collections, Roles and System Architecture The IT requirements are imposed by our protocols. Prior to listing these requirements, we introduce a few simple concepts that determine the structure of the repository. First, the repository distinguishes between three types of collections of files: 1. Data Acquisition Collections (DACs) 2. Scientific Integrity Collections (SICs) 3. Data Sharing Collections (DSCs) Collections are defined by their metadata, the access rights with respect to the metadatafields, and the postprocessing of the metadata (see further). Second, the repository allows for role-based access to the data (i.e., the files in the collections) and metadata. There are roles at the level of the organization (o), the organizational units (ou), and the collections (coll). At the organization level, there are two roles: 1. anonymous_user 2. o_user At the level of some organizational unit, there are also two roles: 1. research_administrator 2. ou_ reviewer At the level of some collection in some organizational unit, there are three roles: 1. collection manager 2. collection reviewer 3. collection viewer Third, the architecture of the system will be of the client-server type. We will use this architecture to allow for the collections being access through different clients. One of these clients will be a web client, and this one will be used to read and edit the collection metadata, as well as the user profiles. Authentication Because a web client is not suitable for file transfer, we distinguish between authentication for the web-client and authentication for a file transfer client. Authentication for the web-client For the web-client, the user authenticates against an Identity Provider (IdP). Depedent on the IdP against which the user authenticates, he has/can get different rights. For access to DACs and SICs, it is required that the user authenticates using a trusted federated authentication Page 2 of 7 service (Surfconext, EduGain). For access to DSC, it is sufficient if the user authenticates against one of the popular IdPs (Google, Facebook, Twitter, …). Authentication for a file transfer client Authentication for a file transfer client requires that users first authenticate for the web client. Via the web client, the user can obtain a one-time password with which he can authenticate for the file transfer client. This authentication scheme will be implemented for webdav clients. Fallback to userID-password If the preferred authentication scheme – described in previous paragraphs – does not allow to realize all functional requirements, it must be possible to use the traditional authentication using a userID-password pair. Collection Definition A collection is defined by (1) its metadata, (2) a particular role-dependent metadata access, and (3) post-processing of metadata. The different metadata types Free text Alphanumeric with a maximum number of characters. Numerical A number with a given unit. Controlled vocabulary An element of a controlled vocabulary, possibly hierarchical (see, MeSH). Role-dependent metadata access The write access to a given metadata field is role-dependent. For example, some metadata fields can only be edited by research administrators. Also, some metadata fields can only be written by the system, and therefore no role allows for editing these fields (e.g., the systeminternal collection ID). Post-processing metadata Some metadata fields require post-processing. For example, this holds for the field that specify the disk quotum, and the field that specifies that a frozen copy has to be made. Collection Initiation Collection initiation is performed by a research administrator. He does this by completing metadata fields for which, with one exception, only he is authorized: Page 3 of 7 1. Authorizing collection managers and a reviewer. (Note: A collection manager can also be authorized by another collection manager.) 2. Assigning a disk quota 3. Completing administrative metadata The first two metadata fields require post-processing. A collection is initiated with default values for some metadata fields. Collection Building Collection building is performed by the collection managers and contributors. It involves both metadata and data. The metadata are accessed via the web-client, and the data via the file transfer client. Editing metadata This involves the following: 1. Authorizing collection managers, contributors and viewers. Only the collection manager is authorized for this. 2. Completing research-related metadata. Both collection managers and contributors are authorized for this. File up- and download Both collection managers and contributors are authorized for this. Collection Closure and Versioning Collection closure Collection closure involves the following steps: 1. A collection manager requests for collection closure by setting the value of some metadata field. After that field is set, the collection becomes read-only (while keeping the old authorizations as information), and the collection is highlighted in the web-client view of the reviewer. 2. There are two possibilities: a. The collection reviewer approves collection closure by setting the value of some metadata field. Following approval, a frozen copy with PID is generated. b. The collection reviewer does not approve the collection, and the original write authorizations for this collection are reinstalled. Versioning It is possible to make multiple frozen copies of the same collection. Via their PIDs, it is possible to reconstruct the sequence in which they were generated. Page 4 of 7 Authentication-Method-Dependent Collection Access Access to collections depends on the authentication method. No authentication Only the metadata of the DSCs can be read. Authentication against a non-trusted IdP Only the metadata of the DSCs can be read. The authenticated user can be authorized as a viewer of a DSC. Authentication against a trusted IdP Authorizations for data and metadata are determined by (1) the collection-level authorizations, and (2) the user profile. Browsing, Sorting, and Searching Collections In the web-client, the user can browse, sort and search for collections. For sorting and browsing, he can make use of the collections’ metadata fields. User Profile Editing Role-dependent user profile editing The editing of the fields of the user profile is role-dependent: some fields can be edited by the user himself, others by a research administrator, and still others by a system administrator. Center-level authorizations o_user A research administrator can edit a field in a user profile, giving that user access to the metadata of all of the collections of an organizational unit. Such a user is called a o_user. Only users that have registered via a trusted IdP can become an o_user. ou_reviewers A research administrator can edit a field in a user profile, giving that user read access to all of the collections of an organizational unit. Such a user is called an ou_reviewer. Research administrators A system administrator can edit a field in a user profile, giving that user all the rights that belong to the research administrator role. Page 5 of 7 Linking internal user accounts A research administrator can change the field in the user profile that contains the systeminternal ID associated with the user account. This allows for continuity in the authorizations in case a user changes IdP. Requirement Description Motivation from Use Case Definition of collection types in terms of their metadata Definition of a namespace in which the collections are organized according to organizational unit Definition of roles in terms of their rights with respect to specific collections Users interact via a web-interface (for editing metadata and authorizing users) and specialized clients for file up- and download Possibility to use multiple clients to interact with the same middleware layer Federated authentication Data repository can be used by different organizational units that have controlled access to each other’s collections Page 6 of 7 Importance (1 - very important to 5 - not at all important) Data repository can be organized such that the metadata of collections that can be shared (DSCs) are visible to the world and can be searched by web crawlers Scalable to the petabyte level Hardware independent (in the sense that the logical namespace does not change when all the files are migrated) Page 7 of 7