4. WP3 Tasks 22-26: Advanced Climate Diagnostics and

advertisement
Climate Diagnostics for an Exascale Archive:
The ExArch Climate Diagnostics Benchmark
v0.7, July 20, 2011.
Prologue
This document focuses on the science deliverables of ExArch, including Quality Assurance of
climate data served on the Earth System Grid and exascale archives, development of climate data
operators for diagnostic analysis, and application of the software developed within ExArch to
standard and advanced climate diagnostics. This diverse set of tasks will be packaged in the
ExArch Climate Diagnostics Benchmark (CDB) for public distribution, performance
benchmarking, and regression testing. The document includes tasks specifications, description of
the ExArch CDB, and milestone dates for delivery.
Date
Version
Comment
04/22/11
0.1
Initial draft, prepared by FBL.
05/05/11
0.2
Revised draft, prepared by FBL.
05/13/11
0.3
Revised draft, prepared by PJK
05/16/11
0.4
Revised draft, prepared by PJK
05/17/11 12:00 UTC
0.5
Revised draft, prepared by Frank Toussaint
05/20/11
0.6
Minor additions in T21 by FT
7/20/11
0.7
Revised draft, including Climate Diagnostics Benchmark
description.
1 Introduction
WP3 is concerned with the proof of concept and science deliverables of the ExArch project. A key
goal of WP3 is to examine the issues associated with diagnostics on the output of exascale climate
computing and prototype and implement solutions with real scientific benefit. ExArch will promote
the development of services to support CMIP5 and CORDEX with reference to Earth Observations
(EO) from the JIFRESE EO archive and reanalysis datasets. These services will be evaluated by
carrying out a set of real scientific studies, addressing quantities of interest from a distributed
archive using robust and scalable algorithms.
Many of the possible solutions to handling exascale data may be theoretically attractive but founder
upon ease-of-use and/or inter-institutional difficulties. The only way these obstacles can be
evaluated and addressed is by whole system testing on actual science problems. The problems
selected in the ExArch proposal for such testing are also ones which push the boundaries of what
can be done in terms of data, and are therefore suitable as guidelines as to what can be achieved
beyond the end-of-the project with exascale data access. WP3 will address specific goals that deal
with issues of increased data volume associated with increased model resolution, increased numbers
of model fields output, and increased ensemble size at exascale, as well as the need for climate data
operators that more closely reflect the formulation of climate model numerics.
We propose to package the results of the diverse tasks within the WP in the "ExArch Climate
Diagnostics Benchmark" (ExArch CDB), which will refer to the set of codes and results developed
in the WP for public distribution, performance benchmarking, and regression testing.
In this document are described the implementation plan for Tasks 19-26 outlined in the ExArch
proposal and related activities required to carry out these tasks. The ExArch CDB will be described
in Section 2. Section 3 describes the implementation of services for quality assurance and the
extension of the Climate Data Operators (CDO) suite to operate on distributed exascale archives.
Section 4 describes the implementation of a suite of diagnostics comparing model output to Earth
Observations and the further development of climate metrics of different levels of sophistication.
Section 5 summarizes milestones, deliverables, and timelines, as well as possible dependencies in
the project that might constrain progress on achieving the objectives of WP3.
2. ExArch Climate Diagnostics Benchmark (ExArch CDB)
The ExArch CDB will package the results and deliverables of WP3. It will provide sample
diagnostics and a performance benchmark for the ExArch software. It will include examples of the
QA tests described in Section 3, the set of diagnostics described in Section 4, documentation,
scientific results, and computational performance results.
The ExArch CDB will be available on a repository that will be read accessible to the general public
and read/write accessible to ExArch developers. The ExArch CDB will be sufficiently simple to
use, portable, and well documented to be adaptable for climate analysis applications for the broad
user community. It will not provide a comprehensive climate analysis package, but provide a
starting point for researchers who wish to develop such a package for application to CMIP5 and
CORDEX. As a legacy of the ExArch project, the ExArch CDB will provide the starting point for
future development of diagnostics on exascale climate archives.
The ExArch CDB will be a relatively loosely organized package intended to test the general
functionality of the ExArch software in realistic and scientifically relevant applications. Climate
diagnostic scripts in the ExArch CDB will adhere to a simplified application interface (requiring,
for example, path names/URLs to model output and observational data, identification of desired
variables and analysis domains, and output location) but will otherwise be weakly constrained. It is
expected that scripts in the ExArch CDB will call several different applications used for climate
analysis, for example, Climate Data Operators (CDO), Netcdf Operators, python, ncl, ferret, and
grads, and other non-commercial packages, all of which are currently in use by the climate science
community. However, an overarching constraint will be that all climate analysis applications will be
assumed to be OpenDAP enabled and amenable to eventual query-based processing.
The reason for following this kind of design approach, rather than a comprehensive API design, is
that climate diagnostics are often closely tied to the expertise and programming style of individual
scientists. It is hoped that this design will encourage rapid dissemination of ExArch capabilities as a
primary means of analyzing distributed climate model data. For example, this design will enable
scientists who have already developed climate analysis codes for locally stored data to adopt their
code to handle distributed data.
The implementation of the climate diagnostics using the ExArch query system will be described in
a “how-to” guide included in the ExArch CDB. The goal is to provide potential users with a simple
step-by-step explanation of the functionalities provided by the new architecture. Since the chosen
diagnostics cover a wide range of advanced applications it is expected that most users will be able
to adapt our scripts for their own applications.
The how-to guide will be written in parallel with the implementation of the diagnostics in Section 4.
A template for the how-to guide will be spearheaded by FBL and a preliminary version should be
available for the first release of the ExArch CDB. The documentation format will conform with
open source standards for documentation (PDF/Wiki format to be decided).
A preliminary implementation of the ExArch CDB will be available by Summer 2012. Eventually,
all output from completed Tasks within ExArch WP3 will be included in the ExArch CDB.
3. WP3 Tasks 19-21: Development of Quality Assurance and Climate
Data Operators
Tasks 19-21 focus on infrastructure requirements for climate diagnostics, including quality
assurance and development of climate data operators for server side processing. The results of these
Tasks will be implemented in the climate diagnostics Tasks 22-26, and thus will be included in the
ExArch CDB.
3.1 Task 19: Quality Assurance – Schema design
Quality control will become increasingly important in an exascale computing context. Researchers
will be dealing with millions of data files from multiple sources and will need to know whether the
files and the output fields within them satisfy a range of basic quality criteria. Such quality
assurance needs to be carried out before the data can be credibly used for scientific research as
automated evaluation processes need to run on homogeneously error free data.
Task 19 will be responsible for developing a new mulitple-level quality control process, for the
CMIP5 and CORDEX archives. The schema (model) will cover the following levels:
(1) Technical quality assurance on file’s outside appearance:
File size, file checksum, file name (for automated access), file extension (if applicable).
(2) Technical quality assurance on file’s inner structure:
Elementary syntax check. Each file will be evaluated to ensure that it conforms to the NetCDF-CF
structure. Completeness and structure of necessary header keywords will be checked.
(3) Internal and external Metadata quality assurance:
Within each file, the header’s metadata (MD) will be evaluated to ensure that its values conform to
the CMIP5 standard. In addition, the file to file consistency of MD is checked where applicable.
(4) Axes quality assurance:
Checks (including file to file) on lat, lon, z. Time axis checks for calendric aspects, gaps, doubles,
monotony, step widths, etc.
(5) Basic scientific quality assurance:
Initial data evaluation for output of the CMIP5 archive. ExArch will ensure that the data for
selected variables falls within a desired range of values. For example, values of precipitation rates
will be checked to make sure that fall within the range [0, 10^4 mm/day], unless the storage grid is
a spherical harmonic one.
(6) Intermediate scientific quality assurance:
The quality of a given field relative to selected Earth Observations (Section 4) will be evaluated via
bias estimates and via statistics found within a Taylor (2001) diagram, e.g. relative spatial variance,
relative pattern correlation, and standard error. ExArch will aim to develop such checks for a
limited subset of data including surface temperature, precipitation, zonal mean zonal wind, OLR,
and shortwave absorbed by the atmospheric column. Visual spot checks of data by a scientist are
planned here, as well.
(7) Advanced scientific quality assurance:
Some of the advanced diagnostics discussed in Section 4 will also be employed on each dataset. If
these diagnostics (e.g. monsoon and cyclone statistics) can be applied successfully, it could be
concluded that the data has production grade quality. Such an application for these diagnostics will
be developed by Year 3.
The extension of the CMIP5 archive to regional model output from CORDEX will require a
separate quality assurance analysis but will follow the plan of items (1)-(4) above. As milestones
for CMIP5 are met, it is expected that milestones for the CORDEX output will be achieved with
roughly a one-year lag. These involve:
(1) Technical quality assurance on file’s outside appearance - CORDEX
(2) Technical quality assurance on file’s inner structure - CORDEX
(3) Internal and external Metadata quality assurance – CORDEX
(4) Axes quality assurance - CORDEX
(5) Basic scientific quality assurance - CORDEX
(6) Intermediate scientific quality assurance - CORDEX
(7) Advanced scientific quality assurance - CORDEX
The use of the software for CORDEX data calls for a modular approach, where extensive setup
functionalities allow for application to various data structures and user selection of severity levels
for all quality checks. It is planned to integrate the quality checking tools into the CDO operator
package.
Detailed design of this schema will begin in the first half of Year 1 and will yield
- A software requirement analysis (end of 2011)
- The decision whether or not to integrate the tools into CDO (end of 2011)
- A design and implementation plan (1st quarter 2012)
3.2 Task 20: Quality Assurance –– Scope and Implementation
The scientific quality assurance should evaluate aspects of the data which can be computed
objectively and unambiguously and which will support data selection decisions made by
researchers. This task will implement the set of operations identified in Task 19, complying with the
schema developed there.
(1)-(3) Technical quality assurance. This objective check of conformity with data standards will be
designed around scripts (e.g., Python) that check if structure of and metadata in a file conform in
such quantities as sizes, names, units, calendar, etc. to a specified standard. This task will require a
method of representing the specified standard.
(4)-(7) Basic, intermediate and advanced scientific quality assurance. A series of scripts based on
CDO (Task 21) or other climate analysis packages (NCO or similar) will be developed to carry out
this QA. E.g. reasonable ranges of values for a given set of fields will be tabulated and checked for
in the data files; quality of data against a limited number of observational fields in surface
temperatures and winds will be checked; advanced diagnostics developed
It is planned to use a digital signature to provide reliable identification of the quality assurance
provider.
The implementation will start in the second half Year 1 and go on during Years 2-3 (Task 20) and
will yield:
- Standards specification in machine readable form (CMIP5, NetCDF-CF and other), draft:
end of year one, document 1st quarter 2012
- Design plan for check control output (readable, & xml), draft: end of year one, document 1st
quarter 2012
- Pilot version of checking tool, end of 2012
The capabilities of the QA processing will be implemented in the climate diagnostic scripts in Tasks
22-26. Since these tasks will be included in the ExArch CDB, examples of QA processing will be
included by default in the ExArch CDB.
3.3 Task 21: Climate Data Operators (CDO) in an exascale archive
CDO is a collection of command line Operators to manipulate and analyse Climate and forecast
model Data. A range of formats are supported and over 400 operators are provided. The current
library is designed to work in a scripting environment with local files. This task will explore the
extensions required to support efficient usage in an exascale environment with local files. Some
extensions to the CDO functionalities will be necessary, existing other data manipulation operators
(NCO etc) will be exploited where adequate.
Operators should be able to provide resource estimates to support scheduling decisions on operator
order, execution start time and performance issues. Here the available computer memory has to be
taken into account, too. As the code of some operators of the CDO is parallelised (OpenMP), we in
addition will exploit how far this can support performance of the processing.
Plain text output will need to be complemented with an extensible and self-descriptive form which
supports aggregation of results from large file collections. This task will focus on the CDO library
developments needed to support the evaluation of the diagnostics used in WP3.2 in an exascale
archive.
A software requirement analysis and design will be ready until the first quarter of 2012. It will focus
on estimates of resources and performance. The software implementation will follow until end of
year 2012.
The development of additional operator functionalities will be carried out in years 2 and 3, in
parallel to Advanced Climate Diagnostics.
Tests of the newly developed capabilities of the CDO will be included in selected aspects of Tasks
22-26 (Section 4). Since these tasks will be included in the ExArch CDB, the newly developed
CDO capabilities will be by default tested within the ExArch CDB.
4. WP3 Tasks 22-26: Advanced Climate Diagnostics and Benchmarking
Tasks 22-26 use computationally-intensive climate diagnostics to test the query system and the
server-side processing developed in WP2. In Years 1 and 2, a local scaled-down CMIP5 archive
with computing capability will be created to experiment with the query system. In Years 2 and 3,
this capability will be extended to regional model output from CORDEX. Using this local
architecture, the different climate diagnostics scripts will be benchmarked at different stages of
software development. Tasks within this will employ the results of Tasks 19-21 (quality assurance
and CDO development) as they become available.
The analysis scripts described below will be incorporated into the ExArch Climate Diagnostics
Benchmark. The tasks listed below highlight scientifically relevant diagnostics that will test many
of the capabilities of the ExArch software. They make extensive use of 3D data sampled at high
frequency, dynamical time integration and/or computation of multi-dimensional probability
distributions. In the short term (Year 1), these diagnostics will be developed using the classical
client side model of extensive download and data processing on the client side. In the medium-tolong term (Years 2-3), more of the processing will be transferred server-side using the ExArch
software. And as a view to the future of exascale climate data processing, we will also aim to
develop a frameworks for processing algorithms that endeavour to be as consistent with each
climate model's numerical scheme.
Plans on research publications arising from the initial application of the ExArch software will be
documented within each of Tasks 22-26.
4.1 Task 22: Consistency of models and observations
Observations will also be used in other tasks looking at a range of climate processes: this task will
look at basic measures of consistency in climate fields. Both primary observations and reanalysis
datasets will be exploited. Comparisons of mean fields and frequency distributions will be
evaluated. Quantitative estimates of model error characteristics and bias corrections for the input to
assessment models will be made. This task will exploit the validation databases prepared by UCLA
and collaborators with independent funding.
4.2 Task 23-26: Advanced climate diagnostics
During fall 2011, a preliminary version of the ExArch CDB, centred on the diagnostics discussed
below, will begin to be tested with a simple server-side processing framework. This phase should
allow us to identify the performance (memory use, scalability, throughput, i/o bottlenecks) of our
diagnostics. For poorly performing diagnostics, WP3 will investigate whether new features within
CDO or the other climate analysis software employed by the project could improve client and
server-side performance. This would lead to WP3 formulating requests to the approrpiate
development teams for implementation. Similarly, initial tests using the WP2 syntax for server-side
processing will likely help identify missing functionalities that will be required for the advanced
diagnostics discussed next. By providing an early assessment, WP3 will ensure that the following
list of diagnostics will be optimized within the WP2 framework and therefore be sufficient as a
proof of concept.
By the end of winter 2012, it is expected that some of the diagnostics will have been successfully
applied to the CMIP5 archive and drafting of an accompanying paper intended for peer-review
should be under way. The specific topic for this future paper will be first decided over the month of
June 2011 but might change depending on how the implementation progresses.
During the remainder of 2012, the next version of the advanced climate diagnostics package will be
developed. It will include analysis of CORDEX data, as it becomes available. These new scripts
will be implemented in the ExArch CDB to add to the suite of capabilities tested in it.
4.2.1 Task 23: Monsoon Systems
This Task covers characteristics of intraseasonal-to-interannual variability in the tropics and will be
based initially on standard processing scripts in use at GFDL, with Balaji acting as contact. The
diagnostics will include:
1.
A set of diagnostics to calculate ENSO related statistics, including temporal spectra of
SST in the NINO3 region; and regression maps of patterns of OLR, precipitation, and
geopotential height coherent with NINO3. These will initially be based on standard
diagnostic scripts developed by G. Lau and M. Nath at GFDL. These will be
implemented by Balaji and an undergraduate student at GFDL in summer and fall
2011.
2.
A set of diagnostics to calculate MJO related statistics, including analysis of tropical
wave variability (Wheeler/Kiladis plots), and diagnostic techniques to detect MJO
patterns signals (e.g. Waliser et al. 2009). These will initially be based on standard
diagnostic scripts developed by W. Stern at GFDL. These will be implemented by
Balaji and an undergraduate student at GFDL in summer and fall 2011.
3.
Classical Monsoon diagnostics (to be developed).
The resulting scripts will be included in the initial version of the ExArch CDB in spring and
summer 2012. It is intended that a publication will be drafted on the results of these efforts in
Summer 2012.
4.2.2 Task 24: Atmospheric Dynamics: Cyclones, Eddy-fluxes and Extratropical Modes of
variability
One or possibly two cyclone-counting algorithms (Wernli and Schweirz, 2006; Hanson et al., 2004;
Dacre and Gray, 2009; Lambert and Fyfe, 2006) will be implemented in the new framework over
summer and fall 2011 by FBL and a summer undergraduate student. Working scripts will have been
applied to portions of the CMIP5 archive by end of 2011.
In parallel, FBL and LRM will create standard EOF-based diagnostics of intraseasonal persistence
to be benchmarked with the server-side paradigm. The implementation should work across model
and be able to adapt to CORDEX domain specifications.
Other Eulerian-mean based statistical analyses of stormtracks characteristics will be coded to
supplement our library of diagnostics. An example of these diagnostics is a set of maps of variance
of the temporally filtered geopotential height (Blackmon et al.), and the statistical TEM of Pauluis
et al (2011). This phase should be completed by mid-summer and benchmarking of their
performance in a server-side paradigm will have been conducted by spring 2012.
The resulting scripts will be included in the initial version of the ExArch CDB in spring-summer
2012. Publications based on this storm track analysis will be drafted over spring-summer 2012.
4.2.3 Task 25: Snow Cover, hydrology
Analysis of climate parameters related to surface processes are a key priority for climate impacts
diagnostics within the ExArch project. In particular, cold region hydrology, seasonal snow cover,
and sea ice represent parameters are sensitive to climate forcing and are the subject of rapidly
evolving observational methodologies. This Task will develop diagnostics on hemispheric scale
snow cover variability, as well as more regional aspects of seasonal snow cover and hydrology.
The Toronto group will compare snow climatologies of observations, including the Globsnow
dataset, the Canadian Meteorological Snow Analysis, the NOAA Climate Data Record, and
ERA40/ERA Interim reanalyses, with output from NCAR CESM and the Canadian CanCSM4. The
diagnosis will focus on snow phenology (snow season duration, freezing days, melting days, etc.),
interannual variability, and snow-albedo feedback processes diagnosed from the seasonal cycle
(Brown et al. 2010, Fernandes et al. 2009). These analyses will be carried out in collaboration with
Environment Canada and Natural Resources Canada.
The UCLA/JIFRESE group . . .
The resulting scripts will be included in the initial version of the ExArch CDB in spring-summer
2012. Publications based on these analyses will be drafted over spring-summer 2012.
4.2.4 Task 26: Moist Thermodynamics and the General Circulation
FBL will study the effect of moist thermodynamics on the general circulation by computing the
mass flux joint distribution (probability distribution on a 2D parameter space). This will serve as a
large stress test on the ExArch architecture since it requires both high-resolution and highfrequency data.
The computation of the mass flux joint distribution (Pauluis et al, 2008; Pauluis et al, 2010) will be
implemented in a Python script where the user will be allowed to choose between CDO and NCO
for initial array manipulations. This script will be applied to ERA-interim data and to NCAR CESM
1.0 model outputs by May 30th, 2011. It will be used on CMIP5 data when the UofT local node will
be up and should be extended to CORDEX data as soon as these become available.
The mass flux joint distributions computed for the CMIP5 data will be used to study the evolution
climate indices similar to Laliberte and Pauluis (2010). Newer indices of the moist circulation will
be created based on the work of Laliberte (2011, thesis; submitted as Laliberte, Shaw and Pauluis to
JAS). These indices will be used to study the changing role of moist ascents in mid -to-high
latitudes within global warming scenarios.
In the future (2012 and beyond), the UofT group and more specifically FBL aim to expand the
Lagrangian analysis further by calculating air parcels trajectories (e.g Knippertz and Wernli, 2010)
in the CMIP5/CORDEX dataset.
The resulting scripts will be included in the initial version of the ExArch CDB in spring-summer
2012. Publications based on these analyses will be drafted over spring-summer 2012.
5 Organizational aspects
In this section, the logistical aspects of the project are laid out, including the specific goals and
timelines of the project. This will be detailed as the content of the previous sections is revised.
5.1 Technical milestones
- Configuration and purchase of server-side processing node hardware;
- Creating CDO/NCO/python scripts for a suite of climate diagnostics;
- Installation of a working server-side processing node;
- Benchmarking of server-side processing and data transfer for advanced climate
diagnostics;
- Feature request for future developments of CDO and ExArch query system;
5.2 Deliverables
ExArch Climate Diagnostics Benchmark components:
- Quality Assurance for CMIP5/CORDEX experiments
- Climate diagnostics for CMIP5/CORDEX experiments, to be implemented over fall
2011.
- How-to documentation for the diagnostics (ExArch wiki), initial release in January
2012.
- Scaled-down node team-wide use for timely feedback, ready over the summer ad
expected to be available in September 2011 at the latest.
5.3 Deployment time-line
 June 1, 2011: A local CMIP3 archive will go online;
 August 1, 2011: Initial diagnostics scripts written in CDO/NCO/python; Benchmarking of
simple server-side processing; 
 September 30, 2011: Diagnostics documentation and first features request; Early CMIP5
computations; Open access to the local server; Tests of integration with CORDEX;
 November 15, 2011: Final diagnostics features request for first beta release.
 ??? ??, 20xx: Initial testing of Quality Assurance software from Tasks 19-20. 
5.4 Dependency
The evolution of WP3 depends in part on the evolution of the other parts of the ExArch project, but
the delivery plan described above proposes a sequence of developments that should not be affected
by WP1 and WP2. For example, because the diagnostics of Section 4 will be tested at every stage of
the development and included in the ExArch CDB, it will provide a beta testing platform. Finally,
during the development of the diagnostics scripts software limitations will be identified. It is the
role of WP3 to locate these limitations and requests new features for inclusion in subsequent
versions of the software. Future software development will thus depend on the advanced
diagnostics requirements.
Download