Climate Diagnostics for an Exascale Archive: The ExArch Climate Diagnostics Benchmark v0.7, July 20, 2011. Prologue This document focuses on the science deliverables of ExArch, including Quality Assurance of climate data served on the Earth System Grid and exascale archives, development of climate data operators for diagnostic analysis, and application of the software developed within ExArch to standard and advanced climate diagnostics. This diverse set of tasks will be packaged in the ExArch Climate Diagnostics Benchmark (CDB) for public distribution, performance benchmarking, and regression testing. The document includes tasks specifications, description of the ExArch CDB, and milestone dates for delivery. Date Version Comment 04/22/11 0.1 Initial draft, prepared by FBL. 05/05/11 0.2 Revised draft, prepared by FBL. 05/13/11 0.3 Revised draft, prepared by PJK 05/16/11 0.4 Revised draft, prepared by PJK 05/17/11 12:00 UTC 0.5 Revised draft, prepared by Frank Toussaint 05/20/11 0.6 Minor additions in T21 by FT 7/20/11 0.7 Revised draft, including Climate Diagnostics Benchmark description. 1 Introduction WP3 is concerned with the proof of concept and science deliverables of the ExArch project. A key goal of WP3 is to examine the issues associated with diagnostics on the output of exascale climate computing and prototype and implement solutions with real scientific benefit. ExArch will promote the development of services to support CMIP5 and CORDEX with reference to Earth Observations (EO) from the JIFRESE EO archive and reanalysis datasets. These services will be evaluated by carrying out a set of real scientific studies, addressing quantities of interest from a distributed archive using robust and scalable algorithms. Many of the possible solutions to handling exascale data may be theoretically attractive but founder upon ease-of-use and/or inter-institutional difficulties. The only way these obstacles can be evaluated and addressed is by whole system testing on actual science problems. The problems selected in the ExArch proposal for such testing are also ones which push the boundaries of what can be done in terms of data, and are therefore suitable as guidelines as to what can be achieved beyond the end-of-the project with exascale data access. WP3 will address specific goals that deal with issues of increased data volume associated with increased model resolution, increased numbers of model fields output, and increased ensemble size at exascale, as well as the need for climate data operators that more closely reflect the formulation of climate model numerics. We propose to package the results of the diverse tasks within the WP in the "ExArch Climate Diagnostics Benchmark" (ExArch CDB), which will refer to the set of codes and results developed in the WP for public distribution, performance benchmarking, and regression testing. In this document are described the implementation plan for Tasks 19-26 outlined in the ExArch proposal and related activities required to carry out these tasks. The ExArch CDB will be described in Section 2. Section 3 describes the implementation of services for quality assurance and the extension of the Climate Data Operators (CDO) suite to operate on distributed exascale archives. Section 4 describes the implementation of a suite of diagnostics comparing model output to Earth Observations and the further development of climate metrics of different levels of sophistication. Section 5 summarizes milestones, deliverables, and timelines, as well as possible dependencies in the project that might constrain progress on achieving the objectives of WP3. 2. ExArch Climate Diagnostics Benchmark (ExArch CDB) The ExArch CDB will package the results and deliverables of WP3. It will provide sample diagnostics and a performance benchmark for the ExArch software. It will include examples of the QA tests described in Section 3, the set of diagnostics described in Section 4, documentation, scientific results, and computational performance results. The ExArch CDB will be available on a repository that will be read accessible to the general public and read/write accessible to ExArch developers. The ExArch CDB will be sufficiently simple to use, portable, and well documented to be adaptable for climate analysis applications for the broad user community. It will not provide a comprehensive climate analysis package, but provide a starting point for researchers who wish to develop such a package for application to CMIP5 and CORDEX. As a legacy of the ExArch project, the ExArch CDB will provide the starting point for future development of diagnostics on exascale climate archives. The ExArch CDB will be a relatively loosely organized package intended to test the general functionality of the ExArch software in realistic and scientifically relevant applications. Climate diagnostic scripts in the ExArch CDB will adhere to a simplified application interface (requiring, for example, path names/URLs to model output and observational data, identification of desired variables and analysis domains, and output location) but will otherwise be weakly constrained. It is expected that scripts in the ExArch CDB will call several different applications used for climate analysis, for example, Climate Data Operators (CDO), Netcdf Operators, python, ncl, ferret, and grads, and other non-commercial packages, all of which are currently in use by the climate science community. However, an overarching constraint will be that all climate analysis applications will be assumed to be OpenDAP enabled and amenable to eventual query-based processing. The reason for following this kind of design approach, rather than a comprehensive API design, is that climate diagnostics are often closely tied to the expertise and programming style of individual scientists. It is hoped that this design will encourage rapid dissemination of ExArch capabilities as a primary means of analyzing distributed climate model data. For example, this design will enable scientists who have already developed climate analysis codes for locally stored data to adopt their code to handle distributed data. The implementation of the climate diagnostics using the ExArch query system will be described in a “how-to” guide included in the ExArch CDB. The goal is to provide potential users with a simple step-by-step explanation of the functionalities provided by the new architecture. Since the chosen diagnostics cover a wide range of advanced applications it is expected that most users will be able to adapt our scripts for their own applications. The how-to guide will be written in parallel with the implementation of the diagnostics in Section 4. A template for the how-to guide will be spearheaded by FBL and a preliminary version should be available for the first release of the ExArch CDB. The documentation format will conform with open source standards for documentation (PDF/Wiki format to be decided). A preliminary implementation of the ExArch CDB will be available by Summer 2012. Eventually, all output from completed Tasks within ExArch WP3 will be included in the ExArch CDB. 3. WP3 Tasks 19-21: Development of Quality Assurance and Climate Data Operators Tasks 19-21 focus on infrastructure requirements for climate diagnostics, including quality assurance and development of climate data operators for server side processing. The results of these Tasks will be implemented in the climate diagnostics Tasks 22-26, and thus will be included in the ExArch CDB. 3.1 Task 19: Quality Assurance – Schema design Quality control will become increasingly important in an exascale computing context. Researchers will be dealing with millions of data files from multiple sources and will need to know whether the files and the output fields within them satisfy a range of basic quality criteria. Such quality assurance needs to be carried out before the data can be credibly used for scientific research as automated evaluation processes need to run on homogeneously error free data. Task 19 will be responsible for developing a new mulitple-level quality control process, for the CMIP5 and CORDEX archives. The schema (model) will cover the following levels: (1) Technical quality assurance on file’s outside appearance: File size, file checksum, file name (for automated access), file extension (if applicable). (2) Technical quality assurance on file’s inner structure: Elementary syntax check. Each file will be evaluated to ensure that it conforms to the NetCDF-CF structure. Completeness and structure of necessary header keywords will be checked. (3) Internal and external Metadata quality assurance: Within each file, the header’s metadata (MD) will be evaluated to ensure that its values conform to the CMIP5 standard. In addition, the file to file consistency of MD is checked where applicable. (4) Axes quality assurance: Checks (including file to file) on lat, lon, z. Time axis checks for calendric aspects, gaps, doubles, monotony, step widths, etc. (5) Basic scientific quality assurance: Initial data evaluation for output of the CMIP5 archive. ExArch will ensure that the data for selected variables falls within a desired range of values. For example, values of precipitation rates will be checked to make sure that fall within the range [0, 10^4 mm/day], unless the storage grid is a spherical harmonic one. (6) Intermediate scientific quality assurance: The quality of a given field relative to selected Earth Observations (Section 4) will be evaluated via bias estimates and via statistics found within a Taylor (2001) diagram, e.g. relative spatial variance, relative pattern correlation, and standard error. ExArch will aim to develop such checks for a limited subset of data including surface temperature, precipitation, zonal mean zonal wind, OLR, and shortwave absorbed by the atmospheric column. Visual spot checks of data by a scientist are planned here, as well. (7) Advanced scientific quality assurance: Some of the advanced diagnostics discussed in Section 4 will also be employed on each dataset. If these diagnostics (e.g. monsoon and cyclone statistics) can be applied successfully, it could be concluded that the data has production grade quality. Such an application for these diagnostics will be developed by Year 3. The extension of the CMIP5 archive to regional model output from CORDEX will require a separate quality assurance analysis but will follow the plan of items (1)-(4) above. As milestones for CMIP5 are met, it is expected that milestones for the CORDEX output will be achieved with roughly a one-year lag. These involve: (1) Technical quality assurance on file’s outside appearance - CORDEX (2) Technical quality assurance on file’s inner structure - CORDEX (3) Internal and external Metadata quality assurance – CORDEX (4) Axes quality assurance - CORDEX (5) Basic scientific quality assurance - CORDEX (6) Intermediate scientific quality assurance - CORDEX (7) Advanced scientific quality assurance - CORDEX The use of the software for CORDEX data calls for a modular approach, where extensive setup functionalities allow for application to various data structures and user selection of severity levels for all quality checks. It is planned to integrate the quality checking tools into the CDO operator package. Detailed design of this schema will begin in the first half of Year 1 and will yield - A software requirement analysis (end of 2011) - The decision whether or not to integrate the tools into CDO (end of 2011) - A design and implementation plan (1st quarter 2012) 3.2 Task 20: Quality Assurance –– Scope and Implementation The scientific quality assurance should evaluate aspects of the data which can be computed objectively and unambiguously and which will support data selection decisions made by researchers. This task will implement the set of operations identified in Task 19, complying with the schema developed there. (1)-(3) Technical quality assurance. This objective check of conformity with data standards will be designed around scripts (e.g., Python) that check if structure of and metadata in a file conform in such quantities as sizes, names, units, calendar, etc. to a specified standard. This task will require a method of representing the specified standard. (4)-(7) Basic, intermediate and advanced scientific quality assurance. A series of scripts based on CDO (Task 21) or other climate analysis packages (NCO or similar) will be developed to carry out this QA. E.g. reasonable ranges of values for a given set of fields will be tabulated and checked for in the data files; quality of data against a limited number of observational fields in surface temperatures and winds will be checked; advanced diagnostics developed It is planned to use a digital signature to provide reliable identification of the quality assurance provider. The implementation will start in the second half Year 1 and go on during Years 2-3 (Task 20) and will yield: - Standards specification in machine readable form (CMIP5, NetCDF-CF and other), draft: end of year one, document 1st quarter 2012 - Design plan for check control output (readable, & xml), draft: end of year one, document 1st quarter 2012 - Pilot version of checking tool, end of 2012 The capabilities of the QA processing will be implemented in the climate diagnostic scripts in Tasks 22-26. Since these tasks will be included in the ExArch CDB, examples of QA processing will be included by default in the ExArch CDB. 3.3 Task 21: Climate Data Operators (CDO) in an exascale archive CDO is a collection of command line Operators to manipulate and analyse Climate and forecast model Data. A range of formats are supported and over 400 operators are provided. The current library is designed to work in a scripting environment with local files. This task will explore the extensions required to support efficient usage in an exascale environment with local files. Some extensions to the CDO functionalities will be necessary, existing other data manipulation operators (NCO etc) will be exploited where adequate. Operators should be able to provide resource estimates to support scheduling decisions on operator order, execution start time and performance issues. Here the available computer memory has to be taken into account, too. As the code of some operators of the CDO is parallelised (OpenMP), we in addition will exploit how far this can support performance of the processing. Plain text output will need to be complemented with an extensible and self-descriptive form which supports aggregation of results from large file collections. This task will focus on the CDO library developments needed to support the evaluation of the diagnostics used in WP3.2 in an exascale archive. A software requirement analysis and design will be ready until the first quarter of 2012. It will focus on estimates of resources and performance. The software implementation will follow until end of year 2012. The development of additional operator functionalities will be carried out in years 2 and 3, in parallel to Advanced Climate Diagnostics. Tests of the newly developed capabilities of the CDO will be included in selected aspects of Tasks 22-26 (Section 4). Since these tasks will be included in the ExArch CDB, the newly developed CDO capabilities will be by default tested within the ExArch CDB. 4. WP3 Tasks 22-26: Advanced Climate Diagnostics and Benchmarking Tasks 22-26 use computationally-intensive climate diagnostics to test the query system and the server-side processing developed in WP2. In Years 1 and 2, a local scaled-down CMIP5 archive with computing capability will be created to experiment with the query system. In Years 2 and 3, this capability will be extended to regional model output from CORDEX. Using this local architecture, the different climate diagnostics scripts will be benchmarked at different stages of software development. Tasks within this will employ the results of Tasks 19-21 (quality assurance and CDO development) as they become available. The analysis scripts described below will be incorporated into the ExArch Climate Diagnostics Benchmark. The tasks listed below highlight scientifically relevant diagnostics that will test many of the capabilities of the ExArch software. They make extensive use of 3D data sampled at high frequency, dynamical time integration and/or computation of multi-dimensional probability distributions. In the short term (Year 1), these diagnostics will be developed using the classical client side model of extensive download and data processing on the client side. In the medium-tolong term (Years 2-3), more of the processing will be transferred server-side using the ExArch software. And as a view to the future of exascale climate data processing, we will also aim to develop a frameworks for processing algorithms that endeavour to be as consistent with each climate model's numerical scheme. Plans on research publications arising from the initial application of the ExArch software will be documented within each of Tasks 22-26. 4.1 Task 22: Consistency of models and observations Observations will also be used in other tasks looking at a range of climate processes: this task will look at basic measures of consistency in climate fields. Both primary observations and reanalysis datasets will be exploited. Comparisons of mean fields and frequency distributions will be evaluated. Quantitative estimates of model error characteristics and bias corrections for the input to assessment models will be made. This task will exploit the validation databases prepared by UCLA and collaborators with independent funding. 4.2 Task 23-26: Advanced climate diagnostics During fall 2011, a preliminary version of the ExArch CDB, centred on the diagnostics discussed below, will begin to be tested with a simple server-side processing framework. This phase should allow us to identify the performance (memory use, scalability, throughput, i/o bottlenecks) of our diagnostics. For poorly performing diagnostics, WP3 will investigate whether new features within CDO or the other climate analysis software employed by the project could improve client and server-side performance. This would lead to WP3 formulating requests to the approrpiate development teams for implementation. Similarly, initial tests using the WP2 syntax for server-side processing will likely help identify missing functionalities that will be required for the advanced diagnostics discussed next. By providing an early assessment, WP3 will ensure that the following list of diagnostics will be optimized within the WP2 framework and therefore be sufficient as a proof of concept. By the end of winter 2012, it is expected that some of the diagnostics will have been successfully applied to the CMIP5 archive and drafting of an accompanying paper intended for peer-review should be under way. The specific topic for this future paper will be first decided over the month of June 2011 but might change depending on how the implementation progresses. During the remainder of 2012, the next version of the advanced climate diagnostics package will be developed. It will include analysis of CORDEX data, as it becomes available. These new scripts will be implemented in the ExArch CDB to add to the suite of capabilities tested in it. 4.2.1 Task 23: Monsoon Systems This Task covers characteristics of intraseasonal-to-interannual variability in the tropics and will be based initially on standard processing scripts in use at GFDL, with Balaji acting as contact. The diagnostics will include: 1. A set of diagnostics to calculate ENSO related statistics, including temporal spectra of SST in the NINO3 region; and regression maps of patterns of OLR, precipitation, and geopotential height coherent with NINO3. These will initially be based on standard diagnostic scripts developed by G. Lau and M. Nath at GFDL. These will be implemented by Balaji and an undergraduate student at GFDL in summer and fall 2011. 2. A set of diagnostics to calculate MJO related statistics, including analysis of tropical wave variability (Wheeler/Kiladis plots), and diagnostic techniques to detect MJO patterns signals (e.g. Waliser et al. 2009). These will initially be based on standard diagnostic scripts developed by W. Stern at GFDL. These will be implemented by Balaji and an undergraduate student at GFDL in summer and fall 2011. 3. Classical Monsoon diagnostics (to be developed). The resulting scripts will be included in the initial version of the ExArch CDB in spring and summer 2012. It is intended that a publication will be drafted on the results of these efforts in Summer 2012. 4.2.2 Task 24: Atmospheric Dynamics: Cyclones, Eddy-fluxes and Extratropical Modes of variability One or possibly two cyclone-counting algorithms (Wernli and Schweirz, 2006; Hanson et al., 2004; Dacre and Gray, 2009; Lambert and Fyfe, 2006) will be implemented in the new framework over summer and fall 2011 by FBL and a summer undergraduate student. Working scripts will have been applied to portions of the CMIP5 archive by end of 2011. In parallel, FBL and LRM will create standard EOF-based diagnostics of intraseasonal persistence to be benchmarked with the server-side paradigm. The implementation should work across model and be able to adapt to CORDEX domain specifications. Other Eulerian-mean based statistical analyses of stormtracks characteristics will be coded to supplement our library of diagnostics. An example of these diagnostics is a set of maps of variance of the temporally filtered geopotential height (Blackmon et al.), and the statistical TEM of Pauluis et al (2011). This phase should be completed by mid-summer and benchmarking of their performance in a server-side paradigm will have been conducted by spring 2012. The resulting scripts will be included in the initial version of the ExArch CDB in spring-summer 2012. Publications based on this storm track analysis will be drafted over spring-summer 2012. 4.2.3 Task 25: Snow Cover, hydrology Analysis of climate parameters related to surface processes are a key priority for climate impacts diagnostics within the ExArch project. In particular, cold region hydrology, seasonal snow cover, and sea ice represent parameters are sensitive to climate forcing and are the subject of rapidly evolving observational methodologies. This Task will develop diagnostics on hemispheric scale snow cover variability, as well as more regional aspects of seasonal snow cover and hydrology. The Toronto group will compare snow climatologies of observations, including the Globsnow dataset, the Canadian Meteorological Snow Analysis, the NOAA Climate Data Record, and ERA40/ERA Interim reanalyses, with output from NCAR CESM and the Canadian CanCSM4. The diagnosis will focus on snow phenology (snow season duration, freezing days, melting days, etc.), interannual variability, and snow-albedo feedback processes diagnosed from the seasonal cycle (Brown et al. 2010, Fernandes et al. 2009). These analyses will be carried out in collaboration with Environment Canada and Natural Resources Canada. The UCLA/JIFRESE group . . . The resulting scripts will be included in the initial version of the ExArch CDB in spring-summer 2012. Publications based on these analyses will be drafted over spring-summer 2012. 4.2.4 Task 26: Moist Thermodynamics and the General Circulation FBL will study the effect of moist thermodynamics on the general circulation by computing the mass flux joint distribution (probability distribution on a 2D parameter space). This will serve as a large stress test on the ExArch architecture since it requires both high-resolution and highfrequency data. The computation of the mass flux joint distribution (Pauluis et al, 2008; Pauluis et al, 2010) will be implemented in a Python script where the user will be allowed to choose between CDO and NCO for initial array manipulations. This script will be applied to ERA-interim data and to NCAR CESM 1.0 model outputs by May 30th, 2011. It will be used on CMIP5 data when the UofT local node will be up and should be extended to CORDEX data as soon as these become available. The mass flux joint distributions computed for the CMIP5 data will be used to study the evolution climate indices similar to Laliberte and Pauluis (2010). Newer indices of the moist circulation will be created based on the work of Laliberte (2011, thesis; submitted as Laliberte, Shaw and Pauluis to JAS). These indices will be used to study the changing role of moist ascents in mid -to-high latitudes within global warming scenarios. In the future (2012 and beyond), the UofT group and more specifically FBL aim to expand the Lagrangian analysis further by calculating air parcels trajectories (e.g Knippertz and Wernli, 2010) in the CMIP5/CORDEX dataset. The resulting scripts will be included in the initial version of the ExArch CDB in spring-summer 2012. Publications based on these analyses will be drafted over spring-summer 2012. 5 Organizational aspects In this section, the logistical aspects of the project are laid out, including the specific goals and timelines of the project. This will be detailed as the content of the previous sections is revised. 5.1 Technical milestones - Configuration and purchase of server-side processing node hardware; - Creating CDO/NCO/python scripts for a suite of climate diagnostics; - Installation of a working server-side processing node; - Benchmarking of server-side processing and data transfer for advanced climate diagnostics; - Feature request for future developments of CDO and ExArch query system; 5.2 Deliverables ExArch Climate Diagnostics Benchmark components: - Quality Assurance for CMIP5/CORDEX experiments - Climate diagnostics for CMIP5/CORDEX experiments, to be implemented over fall 2011. - How-to documentation for the diagnostics (ExArch wiki), initial release in January 2012. - Scaled-down node team-wide use for timely feedback, ready over the summer ad expected to be available in September 2011 at the latest. 5.3 Deployment time-line June 1, 2011: A local CMIP3 archive will go online; August 1, 2011: Initial diagnostics scripts written in CDO/NCO/python; Benchmarking of simple server-side processing; September 30, 2011: Diagnostics documentation and first features request; Early CMIP5 computations; Open access to the local server; Tests of integration with CORDEX; November 15, 2011: Final diagnostics features request for first beta release. ??? ??, 20xx: Initial testing of Quality Assurance software from Tasks 19-20. 5.4 Dependency The evolution of WP3 depends in part on the evolution of the other parts of the ExArch project, but the delivery plan described above proposes a sequence of developments that should not be affected by WP1 and WP2. For example, because the diagnostics of Section 4 will be tested at every stage of the development and included in the ExArch CDB, it will provide a beta testing platform. Finally, during the development of the diagnostics scripts software limitations will be identified. It is the role of WP3 to locate these limitations and requests new features for inclusion in subsequent versions of the software. Future software development will thus depend on the advanced diagnostics requirements.