RDA-climate-analytics_use_case

advertisement
RDA Data Fabric IG (DFIG): Climate
(model-)data analytics Use Case
This is use case focuses on the data analytics part to support the climate science community. The
underlying distributed climate data infrastructure we are focusing on is outlined in a separate RDA
DFIG use case document (ENES DFIG use case: https://rd-alliance.org/enes-data-federation-usecase.html ).
1. Scientific Motivation and Outcomes (max. 0.5 pages)
Provide a short summary of the scientific or technical motivation for the use case. What would be the best
possible outcome and why?
On one hand there is a strong requirement to provide “data-near” processing functionality at data
centres to support climate model inter-comparison and climate model evaluation experiments.
These data centres are integrated in distributed climate data federation (see ENES use case) and
offer online (e.g. disk based) as well as off line (e.g. tape system) based data access services.
On the other hand data near processing is required to support e.g. the climate impact community to
provide derived data products without the need to transfer large amounts and generating them at
the home institute.
This processing functionality has to be exposed using standard interfaces (APIs) and needs to be
discoverable (self-describing interfaces as well as a service registries are necessary). In a first step
the processing services are called separately and service chaining etc. is done by the end user (or a
portal). Later better support for automatic service chaining and the support for workflow
orchestration is required. One specific type of service in these chains/workflows are provided by
“conversion/adaptation” components to transform input data of a specific format/content to a
format/content a specific processing component expects.
A major outcome is the enabling of the generation of reproducible derived data products for the
climate science community. The initial main focus is on data products which are derived from
climate model output data (also observational data is involved in the context of climate model data
evaluation). The current practice of downloading files to home institutes and generation of derived
data products (which are then exchanged between researchers and ultimately used in publications)
only can avoided by the provisioning of data analytic services (and components) keeping clear
provenance information based on unique identifiers of the data as well as the processing steps
involved.
2. Functional Description (max. 1 page)
Give at least one diagram that indicates the overall structure/architecture of the data creation and
consumption machinery that is being used in the lab/infrastructure. Describe in simple words the functioning of
the machinery.
In Figure 1 a basic overview is given of the overall architecture (the components are described
“bottom up”):
 Users want access to derived data products
o Either they use the programmatic interfaces (APIs) directly to invoke the processing
services providing these derived data products,
o Or they us a portal (e.g. the Climate4Impact portal) to invoke predefined processing
workflows




To discover and locate appropriate services, users can query a service catalog, where the
services with their APIs (service url, input parameters, output parameters, service
description) are registered. Service offerings include processing services as well as data
discovery services for the “near” data centre. Also data upload services can be included,
whereas for larger amounts of data the “data-ingest” process for the processing is oftenly
handled separately. Uploaded data as well as generated data is registered in a data catalog.
Climate data centres as well as generic resource providers (e.g. EUDAT data centers) expose
processing functionality via standardized, self-describing interfaces (APIs are standardized
based on the OGC WPS standard).
The processing functionality is based on community software components, mainly managed
by an open source developer community. These components normally are packaged for
standard deployment at data centres, using different mechanisms (e.g. python pip packages,
conda/binstar packages, docker images, ..).
The data centres are connected by a (or multiple) backbone mechanisms to handle e.g.
large data ingest, data replication etc. These backbone mechanisms are comprised of
technical components as well as certain regulations and policies etc. established by large
community efforts (like ESGF) or infrastructure efforts (like EUDAT). These backbone
mechanisms often also include overall data catalogs.
3. Describe essential Components and their Services (max. 1 page)
Describe the most essential infrastructural components of the machinery and the kind of services they offer.
These descriptions don't have to be comprehensive.
Data Centre and Resource Providers:
o Offer data near processing resources with (fast) access to large climate data collections
o
Are integrated in data federations offering overall data catalogs, as well as homogenized
data access services (via http, opendap or gridftp)
Processing Component and API
o Specific processing components as well as processing workflows based on a composition of
multiple components are exposed via a standard self-describing API (based on the OGC WPS
specification).
o The (internal) composition mechanism (e.g. workflow machinery used) is hidden to the
outside world and the choice of the individual data centre.
Service Catalog
o A catalog based on standard services to register processing service offerings of the data
centres. No standard community wide catalog is available by now. Currently multiple (local
and fragmented) solutions. One community effort uses an OGC CSW conforming
implementation.
Data Catalog
o Also multiple approaches are currently in development / use. One important approach is to
use SOLR/Lucene indices fed by multiple sources. Another is to also exploit an OGC CSW
catalog implementation.
4. Describe optional/discipline specific Components and their Services
(max. 1 page)
Describe the most essential infrastructural components of the machinery and the kind of services they offer.
These descriptions don't have to be comprehensive.
Community portal(s):
o Normally these don’t interact with the service registry but use statically configured service
endpoints to invoke the processing functionality
o Also derived data products are normally only registered/remembered locally at the portal
Community processing services:
Two types of processing functionality is currently available, or is in development:
o Processing services to support the climate impact community (e.g. like the ones
implemented in the Climate4Impact portal)
o Processing services to support the model inter-comparison community, to handle
basic and often needed multi-model analysis (statistics) involving large amounts of
model data at the model data centre. Activities in this context are coordinated in the
ESGF compute working group (ESGF CWT).
5. Describe essentials of the underlying Data Organization (max. 1 page)
Describe the most important aspects of the underlying data organization and compare it with the model
outlined by DFT.
The data is organized in collections of different granularity (time series, experiment, etc.). Collections
are defined based on controlled vocabularies of coordinated large data production experiments (like
CMIP, CORDEX, ..). While at the lowest level data is most often stored as files (e.g. in the hierarchical,
self describing data format netCDF) the logicat level used to interact with the data is often higher
and decoupled from the “file/directory” view of data storage.
The (binary) data format contains basic metadata including unique identifiers. Some already include
PIDs – there are concrete plans to exploit PIDs for upcoming large climate data collections.
A set of PITs (PID Types) is defined, to allow managing aspects like e.g collections, distributed
versioning and replication.
These “well organized” data entities are often combined for evaluation reasons with data uploaded
by users, or with less structured metadata available (from various sources).
This data is input to the processing components. Outputs are collections which containing e.g.
output data, documentary elements like images, as well as a provenance records describing the
inputs as well as the involved processing steps. Output data is also registered in a catalog.
6. Indicate the type of APIs being used (max. 1 page)
Describe the most relevant APIs and whether they are open for being used.
All APIs are open for being used, with the restriction that the processing APIs currently are only open
within restricted contexts or hidden behind portals, as no generic AAI solution is in place to invoke
processing services at sites. The processing in general is very data intensive, thus the required
restrictions with respect to user groups and/or specific access rights.
The processing API follows the OGC WPS specification, defining operations for service offering
discovery (“GetCapabilities”), specific service discovery (“DescribeProcess”) as well as (synchronous
or asynchronous execution (“Execute”) (see http://www.opengeospatial.org/standards/wps).
Two types of search APIs are mainly used: The ESGF search API (defined on top of a solr/lucene REST
API) and theOGC CSW catalog API which offers OGC standard conforming discovery operations for
data and services ( “DescribeRecord”, “GetRecords” ) as well as push and pull operations to register
metadata records (“Harvest” and “Transaction”).
Processing and catalog/search APIs follow basic REST web service principles and can be invoked
using http “GET” as well as “POST” operations.
7. Achieved Results (max. 0.5 pages)
Describe the results (if applicable) that have been achieved compared to the original motivation.
The exploitation of PIDs with PITs alongside a PIT registry will strongly help in the future to make
data as well as services available in trans-community data analysis activities. The current approach is
to implement prototypes with specific agreements on PITs as well as service descriptions and feed
this back into the RDA PIT registry discussions. The same is true for the support of provenance
information: First specific provenance records are generated, supporting current analysis activities
and later on a standardization and integration e.g. via PIDs relations is planned.
Download