RDA-ENES-use_case - Research Data Alliance

advertisement

RDA Data Fabric IG (DFIG): Use Case

Description Template

Data Management in the ENES data infrastructure

( http://verc.enes.org/data )

Stephan Kindermann, Tobias Weigel

(German Climate Computing Center – DKRZ, Hamburg, Germany)

1.

Scientific Motivation and Outcomes (max. 0.5 pages)

Large international climate model inter-comparison projects stimulated the development of a globally distributed federation of data providers supporting common data provisioning policies and building on common technical e-infrastructure components. The European part is organized in the (IS-)ENES data federation which is an integral part of the worldwide Earth System Grid Federation (ESGF).

ESGF ( http://esgf.llnl.gov

) is an international open source effort to establish a robust, distributed data and computation platform, enabling world wide access to Peta (in the future Exa-) scale scientific data. Building on

ESGF ENES adds services to support the wider climate impact modelling community (see. e.g. the ENES climate4impact portal http://climate4impact.eu) .

Some core data nodes hold replicas of data collections (replication use case) and data collections are updated with newer versions e.g. in case of erroneous data (versioning use case). Important data collections are additionally long term archived and assigned data citation references (DOIs).

Best outcomes of ENES/ESGF data infrastructure are as follows:

Making available more then 2 Petabyte of data to end users based on consistent metadata catalogues.

Integration of model and observational data. Versioning and data replication support. Policies for data producers and data centers how to provide and manage data including basic data quality assurance. Large multi-disciplinary user community – from modellers to climate impact researchers to climate change consultancy agencies.

2. Functional Description (max. 1 page)

Give at least one diagram that indicates the overall structure/architecture of the data creation and consumption machinery that is being used in the lab/infrastructure. Describe in simple words the functioning of the machinery.

Data sources providing climate model data as well as climate observation data prepare their products to adhere to the ENES/ESGF data infrastructure requirements (mostly consistency of project metadata rules with provided data products). The data is ingested in ESGF data nodes and published using a well-defined data publication interface. Data publication in the ENES infrastructure includes the generation of metadata data catalogues as well as indexing these in central search catalogues which are provided by so called Index nodes. Different index nodes are synchronized to be able to provide consistent search APIs for end users as well as data portals. Search results can be downloaded from the worldwide distributed data nodes using a standard data access API. Data is replicated between dedicated “core” data nodes, some of these also acting as long term archives.

Replicated data shows up in the search indices, thus providing possibilities to access the nearest available replica. Versioning is handled at a data collection level – collections with changed data products are published with an updated version flag and show up in the search catalogues.

Some ESGF Nodes provide data near compute facilities, exposed using a standard API following the

OGC WPS specification. These are exploited by portals to provide derived or on-demand data products for end-users (see e.g. the ENES climate4impact portal http://climate4impact.eu

). For climate model data generated in the context of large model inter-comparison projects the scientific

configuration of model runs is collected based on the Common Information Model (CIM) and stored in a central repository. Based on a well-defined API portal plugins can retrieve this information for display alongside data search results.

3. Describe essential Components and their Services (max. 1 page)

Describe the most essential infrastructural components of the machinery and the kind of services they offer.

These descriptions don't have to be comprehensive.

ES-DOC metadata service: provides an API and a central repository to store and retrieve CIM metadata climate model descriptions. A web based questionnaire is used for collecting information from research groups participating in the large climate model inter-comparison projects.

ESGF data nodes:

A collection of open source components packaged and developed as part of the ESGF initiative to provide basic data access functionality associated to metadata catalogues

Data access service: Secured data access based on tomcat web server (http, opendap) and gridftp server

File catalogue service: In most cases a thredds metadata catalogues expose basic metadata of data files and data aggregations including data associated access services (http, opendap, gridftp etc.)

ESGF index nodes:

Often deployed as part of ESGF portals, but independent components providing

User Authentication service: OpenID based federation wide authentication

Attribute service (User Authorization): Group Role Registry – Users are authorized based on their associated group role attributes.

Search Index with API and GUI: A solr based search index is maintained based on the published metadata provided by the ESGF data nodes. Solr indices (shards) are shared between different sites to provide consistent search results. The GUI is structured flexibly based on search facets corresponding to indexed metadata fields. Search results can be file lists as to be accessed via a GUI or data download scripts to be executed by users on a scripting level.

Data publication service:

The data publication service is used by dedicated “data publishers” to ingest data into the

ENES/ESGF data infrastructure. This service extracts basic metadata from the files (which is included there as part of the netcdf header metadata) as well as configurable additional sources (file names, directory structure, configuration files,..). A separate “metadata only” publication is possible, leaving the publisher with the responsibility to ensure data/metadata consistency.

Data replication service:

A service which is triggered by data node administrators to replicate large file collections between data nodes. Replicated data is published with a replica tag in a separate step. The data replication service can also be used by end-users to maintain a local institutional cache of data originally hosted in the ENES/ESGF data federation.

ESGF processing services:

Provide data near processing functionality encapsulated as Web Service based on the OGC Web

Processing Service (WPS) standard. WPS provides a REST like API to query and trigger server side processing functionality. Of high importance are multi-model and data sub setting services, allowing to address the network bandwidth bottleneck problem.

4. Describe optional/discipline specific Components and their Services

(max. 1 page)

Describe the most optional infrastructural components of the machinery and the kind of services they offer.

These descriptions don't have to be comprehensive.

The CIM metadata components (questionnaire, API, metadata standard, associated ES-DOC tools) are climate modelling community specific to characterize the technical and scientific configuration of the complex model runs producing the data.

5. Describe essentials of the underlying Data Organization (max. 1 page)

Describe the most important aspects of the underlying data organization and compare it with the model outlined by DFT.

Data objects and collections: Basic data entities are individual files, file collections are built at various levels: file-sets, often reflecting time-series of files and model-run collections, grouping files together according to their membership in individual experiments, also coarse granular DOI collections are defined according to citable data groupings.

Metadata is associated to individual data objects as well as data collections. Data is uniquely described using various types of identifiers. Work on actionable persistent identifier support is on the way.

Associated to data objects and collections there are data services providing actual access and metadata information. Processing services expose available processing services based on the OGC

WPS specification, allowing to query service offerings as well as the (in/out) characteristics of individual processes based on XML documents.

Documents describing the scientific usage of specific data (collections) can be published alongside the data itself. Annotations to data after data publication (e.g. errata information) is currently handled separately by services not tightly integrated into the ENES data federation, integration work started recently also exploiting PIDs.

6. Indicate the type of APIs being used (max. 1 page)

Describe the most relevant APIs and whether they are open for being used.

Several APIs exist in the ENES/ESGF data infrastructure, most of the APIs are also available to third party systems allowing for easy integration with low level (scripting) requirements of individual researchers as well as the integration with higher level (e.g. portal services).

API name

Attribute service

OpenId service

Purpose of API

(Internal) API to retrieve group attributes of registered users

User OpenId provider service for user authentication

Parameters of API

In: OpenId of user

Out: Group membership list

Standard Openid protocol support – YADIS discovery

Search API

Download Script Generation

API (part of Search API)

Metadata synchronisation service

Processing Service

Search for datasets based on a

(flexible) set of search facets

Generation of download script for data description based on search facets

(Internal API) Synchronisation of solr search indices across sites

Discovery and access of processing functionality (e.g. deployed at larger data nodes)

In: search facets

Out: data sets matching

In: search facets

Out: download script to be executed at client side

(including security (certificate) handling)

Part of distributed solr installation (sharding mechanism etc.)

Parameters and Protocol is defined by OGC WPS standard

7. Achieved Results (max. 0.5 pages)

Describe the results (if applicable) that have been achieved compared to the original motivation.

The ENES/ESGF infrastructure evolved to a stable platform to provide climate simulation data together with selected observational data products. Data provided through this infrastructure was used for the IPCC report on climate change. Beyond the initial motivation – to support the large

CMIP climate inter-comparison projects – also other related projects are starting to use this infrastructure e.g. CLIP-C / Copernicus, CORDEX. It will also be the basis of the next CMIP project, starting this year.

Download