This is a use case description of the “Repository Platforms for Research Data” IG. While points
1, 2 and 3 aim at a general description/overview of the use case, point 4 is meant to list the requirements.
Please, save the file using the name scheme: UseCaseName_UseCase_RepoPlat.docx
1. Scientific Motivation and Outcomes
To provide climate data storage and management services to the national climate community the German Climate Computing Centre DKRZ hosts a large repository platform. This platform is highly connected to repository systems internationally as well integrated into e-science platforms to support international climate research activities (see also the Data Fabric Use case https://rd-alliance.org/enes-data-federation-use-case.html ). The repository system supports data storage and data distribution in large international research projects, long term archival and data citation and thus supports the full climate data life cycle.
Different types of use cases have to be supported:
A) A researcher (or project PI) wants to archive results. Focus is on long term storage and access is limited to a small group of people, thus no complex metadata is provided.
B) Researchers want to make data accessible to a larger community (e.g. for international climate data intercomparison projects). This includes support for data storage and data distribution as well as the management of associated metadata based on agreed metadata standards and conventions.
C) Researchers (and the community as a whole) want to make data and associated metadata long term archived and provided by a long term funded agency. Different discovery metadata formats should be provided to make the data visible in various catalogs. The World Data Centre for Climate hosted at DKRZ provides this type of service.
D) Researchers want to make data citable and referable in scientific publications.
Often use case B), C) and D) are often chained together, so first data is made accessible in
(e.g. international) collaborations, important parts of this data (and associated metadata) is archived (after some quality assurance) and finally data citation references are generated for important data collections.
Page 1 of 4
2. Functional Description
If possible, give one diagram/picture that indicates the overall structure/architecture you have/envision. Describe in simple words the functioning of the use case/system.
The overall repository platform architecture is illustrated in the following figure:
Large model run data collections as well as data collections coming from smaller projects are ing ested into a “data collaboration space”. Depending on the type of data project different policies apply defining e.g. the required structure of data and required content of metadata
Parts of this collaboration space are integrated into international e-science infrastructures via data and metadata servers (e.g. ESGF and IS-ENES). Making data discoverable and accessible in international collaborations is called “project publication” in the above figure.
After a quality assurance process the data and metadata is transferred into the long term archival system. For well-defined collections a data cite DOI is assigned, which can be used to cite the data in publications.
The long term archival centre is integrated on one side to e-science infrastructures (like
ESGF) and on the other into international archival federations like the ICSU world data centres. For this standard conforming metadata interfaces are provided (e.g. supporting
DIF and ISO19115).
Page 2 of 4
3. Achieved Results
Describe results (if applicable) that have been achieved compared to the original motivation.
What requirements could not be fulfilled and how did this influence the outcomes
The assignment of persistent identifiers (PIDs) to data items and data collections is quite late in the data life cycle and at a coarse granularity (DOI assignment after quality assurance and long term archival). During the evaluation phase of data in e.g. international intercomparison projects no persistent and resolvable identifiers can be used and exploited for various use cases (data sharing, replication, ..). Also PIDs are normally not available for projects data at the data ingest point. Thus current activities concentrate on the early assignment of PIDs to research data at
DKRZ. All data related curation activities will then be based or will be integrated with PID management activities.
4. Requirements
Describe the requirements, their motivation from your use case and how you rate their importance. The descriptions don't have to be comprehensive.
Requirement Description Motivation from Use
Case
Importance (1 - very important to 5 - not at all important)
Support for different data ingest policies
1
Assignment of PID At data ingest time /or
“project publication” time a PID has to be assigned to data and the collections it belongs to
This makes data persistently referenceable during the data evaluation phase
1
Assignment of DOIs A clear transition for
PID collections to a
DOI has to be established. This has technical as well as non-technical aspects
DOIs should be based on PID
(collections) if available.
1
PID management
Data ingest is related to metadata- data association. Different kinds of metadata need to be supported
(from “ad-hoc” to highly standardized)
Support for different data sources for the repo platform.
All data management activities should be integrated with PID management. PID
PID metadata should reflect actual state of data holding
1
Page 3 of 4
Integration of PID management with international partner institutes
Data repository near data processing has to be provided metadata has to always be in sync with the data/metadata holding
PIDs and PID management should be integrated as a core component of climate e-science infrastructures
Consistent PID handling across institutions is necessary for collaborative use cases
The large data volumes require data near processing (and data reduction) facilities. These should provide and maintain provenance information
The “download and process at home” approach to data analysis is bad practice for a lot of reasons and hinders data reproducibility and data sharing.
1
2
Page 4 of 4