RDA Data Fabric IG (DFIG): Climate (model-)data analytics Use Case This is use case focuses on the data analytics part to support the climate science community. The underlying distributed climate data infrastructure we are focusing on is outlined in a separate RDA DFIG use case document (ENES DFIG use case: https://rd-alliance.org/enes-data-federation-usecase.html ). 1. Scientific Motivation and Outcomes (max. 0.5 pages) Provide a short summary of the scientific or technical motivation for the use case. What would be the best possible outcome and why? On one hand there is a strong requirement to provide “data-near” processing functionality at data centres to support climate model inter-comparison and climate model evaluation experiments. These data centres are integrated in distributed climate data federation (see ENES use case) and offer online (e.g. disk based) as well as off line (e.g. tape system) based data access services. On the other hand data near processing is required to support e.g. the climate impact community to provide derived data products without the need to transfer large amounts and generating them at the home institute. This processing functionality has to be exposed using standard interfaces (APIs) and needs to be discoverable (self-describing interfaces as well as a service registries are necessary). In a first step the processing services are called separately and service chaining etc. is done by the end user (or a portal). Later better support for automatic service chaining and the support for workflow orchestration is required. One specific type of service in these chains/workflows are provided by “conversion/adaptation” components to transform input data of a specific format/content to a format/content a specific processing component expects. A major outcome is the enabling of the generation of reproducible derived data products for the climate science community. The initial main focus is on data products which are derived from climate model output data (also observational data is involved in the context of climate model data evaluation). The current practice of downloading files to home institutes and generation of derived data products (which are then exchanged between researchers and ultimately used in publications) only can avoided by the provisioning of data analytic services (and components) keeping clear provenance information based on unique identifiers of the data as well as the processing steps involved. 2. Functional Description (max. 1 page) Give at least one diagram that indicates the overall structure/architecture of the data creation and consumption machinery that is being used in the lab/infrastructure. Describe in simple words the functioning of the machinery. In Figure 1 a basic overview is given of the overall architecture (the components are described “bottom up”): Users want access to derived data products o Either they use the programmatic interfaces (APIs) directly to invoke the processing services providing these derived data products, o Or they us a portal (e.g. the Climate4Impact portal) to invoke predefined processing workflows To discover and locate appropriate services, users can query a service catalog, where the services with their APIs (service url, input parameters, output parameters, service description) are registered. Service offerings include processing services as well as data discovery services for the “near” data centre. Also data upload services can be included, whereas for larger amounts of data the “data-ingest” process for the processing is oftenly handled separately. Uploaded data as well as generated data is registered in a data catalog. Climate data centres as well as generic resource providers (e.g. EUDAT data centers) expose processing functionality via standardized, self-describing interfaces (APIs are standardized based on the OGC WPS standard). The processing functionality is based on community software components, mainly managed by an open source developer community. These components normally are packaged for standard deployment at data centres, using different mechanisms (e.g. python pip packages, conda/binstar packages, docker images, ..). The data centres are connected by a (or multiple) backbone mechanisms to handle e.g. large data ingest, data replication etc. These backbone mechanisms are comprised of technical components as well as certain regulations and policies etc. established by large community efforts (like ESGF) or infrastructure efforts (like EUDAT). These backbone mechanisms often also include overall data catalogs. 3. Describe essential Components and their Services (max. 1 page) Describe the most essential infrastructural components of the machinery and the kind of services they offer. These descriptions don't have to be comprehensive. Data Centre and Resource Providers: o Offer data near processing resources with (fast) access to large climate data collections o Are integrated in data federations offering overall data catalogs, as well as homogenized data access services (via http, opendap or gridftp) Processing Component and API o Specific processing components as well as processing workflows based on a composition of multiple components are exposed via a standard self-describing API (based on the OGC WPS specification). o The (internal) composition mechanism (e.g. workflow machinery used) is hidden to the outside world and the choice of the individual data centre. Service Catalog o A catalog based on standard services to register processing service offerings of the data centres. No standard community wide catalog is available by now. Currently multiple (local and fragmented) solutions. One community effort uses an OGC CSW conforming implementation. Data Catalog o Also multiple approaches are currently in development / use. One important approach is to use SOLR/Lucene indices fed by multiple sources. Another is to also exploit an OGC CSW catalog implementation. 4. Describe optional/discipline specific Components and their Services (max. 1 page) Describe the most essential infrastructural components of the machinery and the kind of services they offer. These descriptions don't have to be comprehensive. Community portal(s): o Normally these don’t interact with the service registry but use statically configured service endpoints to invoke the processing functionality o Also derived data products are normally only registered/remembered locally at the portal Community processing services: Two types of processing functionality is currently available, or is in development: o Processing services to support the climate impact community (e.g. like the ones implemented in the Climate4Impact portal) o Processing services to support the model inter-comparison community, to handle basic and often needed multi-model analysis (statistics) involving large amounts of model data at the model data centre. Activities in this context are coordinated in the ESGF compute working group (ESGF CWT). 5. Describe essentials of the underlying Data Organization (max. 1 page) Describe the most important aspects of the underlying data organization and compare it with the model outlined by DFT. The data is organized in collections of different granularity (time series, experiment, etc.). Collections are defined based on controlled vocabularies of coordinated large data production experiments (like CMIP, CORDEX, ..). While at the lowest level data is most often stored as files (e.g. in the hierarchical, self describing data format netCDF) the logicat level used to interact with the data is often higher and decoupled from the “file/directory” view of data storage. The (binary) data format contains basic metadata including unique identifiers. Some already include PIDs – there are concrete plans to exploit PIDs for upcoming large climate data collections. A set of PITs (PID Types) is defined, to allow managing aspects like e.g collections, distributed versioning and replication. These “well organized” data entities are often combined for evaluation reasons with data uploaded by users, or with less structured metadata available (from various sources). This data is input to the processing components. Outputs are collections which containing e.g. output data, documentary elements like images, as well as a provenance records describing the inputs as well as the involved processing steps. Output data is also registered in a catalog. 6. Indicate the type of APIs being used (max. 1 page) Describe the most relevant APIs and whether they are open for being used. All APIs are open for being used, with the restriction that the processing APIs currently are only open within restricted contexts or hidden behind portals, as no generic AAI solution is in place to invoke processing services at sites. The processing in general is very data intensive, thus the required restrictions with respect to user groups and/or specific access rights. The processing API follows the OGC WPS specification, defining operations for service offering discovery (“GetCapabilities”), specific service discovery (“DescribeProcess”) as well as (synchronous or asynchronous execution (“Execute”) (see http://www.opengeospatial.org/standards/wps). Two types of search APIs are mainly used: The ESGF search API (defined on top of a solr/lucene REST API) and theOGC CSW catalog API which offers OGC standard conforming discovery operations for data and services ( “DescribeRecord”, “GetRecords” ) as well as push and pull operations to register metadata records (“Harvest” and “Transaction”). Processing and catalog/search APIs follow basic REST web service principles and can be invoked using http “GET” as well as “POST” operations. 7. Achieved Results (max. 0.5 pages) Describe the results (if applicable) that have been achieved compared to the original motivation. The exploitation of PIDs with PITs alongside a PIT registry will strongly help in the future to make data as well as services available in trans-community data analysis activities. The current approach is to implement prototypes with specific agreements on PITs as well as service descriptions and feed this back into the RDA PIT registry discussions. The same is true for the support of provenance information: First specific provenance records are generated, supporting current analysis activities and later on a standardization and integration e.g. via PIDs relations is planned.