Task Force on Seasonal Prediction

advertisement
Task Force on Seasonal Prediction - data handling strategy
Working Draft 1.0 - 09 Dec 2005
The TFSP meeting in Trieste (August 2005) requested that a working group should
create a detailed proposal for data handling for the experimentation to be carried out
for TFSP. The proposal is intended to follow the outline strategy discussed and
provisionally agreed at the Trieste meeting. This document aims to develop the
specifics of the proposed strategy.
A. Trieste strategy
After much discussion of the relative merits of a centralized versus distributed system
of sharing data, it was proposed to try a hybrid solution. That is, standards will be set,
and producing (or distributing) centres will be able to serve their own data so as to
meet the specified standards. Alternatively, if a producing centre would prefer not to
be responsible for serving its own data, it can pass the data to another centre which is
willing to serve it. It is envisaged that there will be several centres which will be
willing and able to serve other people’s data, at least from within a specified region.
The data is envisaged as being served in CF compliant netCDF, probably with an
OPeNDAP (ie DODS) interface. Other data formats and data serving options are
possible as additional extras, but the netCDF service is a mandatory minimum.
Issues that the working group need to consider are:
* The metadata content needed
* How the metadata will be specified in the netCDF
* Agreement on OPeNDAP as the initial web interface standard, and any issues
arising from this.
* The data volumes to be expected, the extent to which they are feasible, and whether
certain parts of the data might need to be made optional.
* Identifying sufficient capacity to serve the expected datasets
* Any recommendations on procedures to ensure correctness of served data
* How the strategy relates to other data strategies and projects
B. Working group proposals
0. Introduction
The following proposal takes account of established practice at operational centres,
usual practice in the research community, and the protocols established at PCMDI for
handling the IPCC data. It also takes account of detailed work undertaken in Europe
on how to make ENSEMBLES data available in CF compliant netCDF. Additionally,
we bear in mind what might be needed to (partially) harmonize the metadata with the
structures being developed by the operational side of WMO for use eg in TIGGE.
Proposal: The TFSP data should be CF compliant netCDF data, with specified
metadata content.
To complete the proposal, we need to specify the required metadata, and also give
rules and guidance on how the metadata is to be encoded in netCDF, and how files
1
ought to be structured for data exchange. These issues are dealt with in the following
sections.
1. Metadata content
In the fist instance, we assume that data to be exchanged are raw model output. If
calibrated forecast products, anomalies, climatologies, verification scores etc are to be
exchanged, then further metadata will be required. Note that although we will discuss
below a particular representation of the metadata in netCDF, it is the metadata
themselves which are the most fundamental part of this proposal. The representation
of the metadata may change in the future, due either to new versions of netCDF and
CF, or possibly even new data formats altogether, but the metadata should be
relatively stable. The metadata discussed here are those needed to define a single
model integration.
Requirement: the metadata must be machine readable, must properly distinguish
different datasets in a way that enables the data to be archived, and must provide
metadata useful for data searching. Metadata should also be useable for automatic plot
labelling.
It is helpful to distinguish which metadata define the data, and which simply provide
additional information. The latter could be used in database searches and for labelling
purposes, but would not form part of any archive structure. These additional variables
are listed below as comments. Some of the metadata are names in the form of strings.
We may want to create different (linked) versions of these, for example one short
fixed version (suitable for long term archival purposes) and one slightly longer more
descriptive version.
Many of the metadata are logically independent, in the sense that specifying one does
not fix the value of another. However, metadata can be linked. For example, we might
provide both a long and a short name for an institution, or we might describe certain
characteristics of a given experiment identifier. Such logical connections are noted
below, since in some representations of the data (notably netCDF), they may affect
how the metadata can or should be coded.
Defining metadata:
i. originating_centre: eg Met Office - centre with scientific responsibility for
integrations (STRING, max length=6 and/or 16) (definition)
[It has been suggested that this should be coordinated with the work by WMO to
define unique identifiers for producing centres, but initial discussions have not been
promising. Perhaps we should have the definition being a unique, time invariant short
string (length 6), and a separate metadata item such as centre_name, which would
include a nice English language name which could be used for labelling etc, and
which might change from time to time as institutes re-brand themselves.]
ii. experiment_identifier (STRING, max length=6 or 16). The originating centre is
fundamentally responsible for assigning unique experiment identifiers for the
different datasets it makes available, and should (ideally) provide documentation of
each experiment. It is possible for common experiment identifiers to be agreed
between different centres, if they are carrying out a common experiment. But there is
2
no a priori guarantee that identical identifiers from different centres refers to
scientifically equivalent experiments. (definition)
iii. forecast_system_version_number (assigned by originating centre; scientific
details of the models used etc should be provided via a web link, INTEGER)
(definition)
iv. forecast_method_number (default =1) (This distinguishes forecasts made with
the same underlying model/forecasting system, but where variations have been
introduced such that the different integrations have different properties, most
importantly different climate drift. An example is the members of a perturbed
parameter ensemble forecast. INTEGER) (definition)
v ensemble_member_number (Different integrations made with the same model
and forecasting system, which form a homogenous and statistically indistinguishable
ensemble. INTEGER) (definition)
Additional metadata:
i. original_distributor: eg ECMWF - centre with responsibility for operational or
research distribution of data, ie the centre who first made the data publicly available,
and to whom queries of data integrity should be sent. (STRING, max length=16)
(comment)
ii. production_status: operational, research, or a user defined <project_id>.
“research” should be used for general research at a centre; project_ids should be used
for specified international research projects. (STRING, max length=16) (comment,
logically associated with experiment identifier)
iii. model_identifier (no default) (STRING, max length=16) (comment, logically
associated with forecast_system_version number)
iv. sst_specification (STRING, “coupled” or “observed” or “predicted” or “persisted
anomaly” or “persisted absolute”, logically associated with experiment identifier)
v: real_time “true” or “false”, according to whether the forecast was made in realtime. Not an attribute of the experiment or the system_version, but of the individual
forecast.
vi: archive_date “YYYYMMDD” or “unknown”. When the data was produced,
archived or published. The aim is to provide an approximate timestamp, to easily
distinguish between recent experiments and much older ones. Also, in the case that
data need to be corrected in a globally distributed data system, the archive_date could
be used to distinguish between the older, original data and the newer, corrected data.
An attribute of the individual model integration.
An appropriate definition of “real time” will need to be given. A first proposal is “a
seasonal forecast issued less than one calendar month after the nominal start date; or a
short to medium range weather forecast issued less than 24 hours after the nominal
start date”.
3
A single experiment from a single centre might include multiple models. Note also
that origin/expver/system/method form a natural ‘tuplet’ which defines a particular
homogenous forecast, whose ensemble size is then spanned by
ensemble_member_number. A ‘multi-model’ forecast consists of a collection of
‘tuplets’. Which elements of the tuplet vary between different members of a multimodel ensemble does not really matter for the processing of the forecast data. Data
from each tuplet are treated as statistically separate; different ensemble members of a
given tuplet are processed together.
Although not needed for distribution and archive purposes, it is suggested that
‘comment’ metadata should be mandatory, since this will give a homogenous dataset
and aid future searching of the data.
The above metadata offer flexibility in describing different experiments, and are
intended to allow fairly straightforward mapping from existing metadata practice in
the global seasonal forecasting community. We strongly request feedback from
producers of seasonal forecast as to whether the above metadata are adequate.
2. Representation of metadata in CF compliant netCDF
CF compliant netCDF provides a language that can be used for describing the data
content of a file. It does not provide a natural language for describing data
independently of the file in which it is embedded. Further, it does not (as yet) provide
a standard logical structure for describing the data in a given file. For example, a set
of six fields, with specified attributes, could be described with those attributes in
several structurally different ways with CF compliant netCDF. In order to produce
files from different groups which are homogenous and consistent (and therefore
amenable to straightforward common processing by software) it is necessary to give
very detailed instructions on how the data should be written - the requirements for
IPCC data are an example of this.
There is an argument that the CF convention should be tightened and/or extended to
simplify this process. ECMWF and the ENSEMBLES project are considering
proposing an extension to the CF convention which would remove these ambiguities
for seasonal forecast data. How such a proposal might look is discussed below.
Whether such a proposal will succeed and become part of the CF convention is not
yet known, but comments on the ideas are invited.
CF compliant netCDF mandates or recommends the following global attributes,
which are designed to document the overall nature of the data:
Conventions “CF-1.0”
Title
Institution
Source
History
References
Comment
The above fields are often filled in as lengthy, human-readable strings, sometimes
with multiple pieces of information under one heading. TFSP recommends that
4
these fields are filled following existing best practice, in a way that clarifies to the
human reader the source and nature of the data. These “human readable” metadata are
intended purely for human consumption, and are not useful for categorizing the data,
since they are unstructured and will be filled in in different ways by different groups.
Example:
Title: Meteo-France seasonal forecast data
Institution: “Model run by Meteo-France. Data processed by ECMWF. Data
distributed by ECMWF.”
Source: “Data generated by Arpege model, run by Meteo-France at ECMWF.”
History:
References: “http://www.ecmwf.int/products/forecasts/seasonal/documentation.html”
Comment: “Part of EUROSIP multi-model forecast system. Use of data subject to
EUROSIP data policy - see web link for details”
Ideally the above would contain more specific information, such as version numbers,
system numbers, resolution etc. However, since data are normally generated
automatically by computer programs, it is hard to have too much detail in freeflowing text of the above sort without the risk that it becomes inaccurate when details
change. Better to be vague than to be wrong.
TFSP recommends that a web link is given which gives access to a full description
of the data, the meaning of experiment identifiers etc, and details of data policy if
required. The use of a web link is much more appropriate than trying to include large
amounts of detail in the netCDF file itself, and also allows relevant information to be
kept up to date. The web link given in the global attributes of the file
Since we are recommending a specific schema or layout for data, it may help if
conformance to this is indicated by a global attribute, particularly in the case that the
meaning of the file structure is not tightly definable by CF compliant standard names.
Thus we propose the global attribute
:schema = “TFSP-1.0”
This also allows a version control on the data layout specification. The string could be
WCRP instead of TFSP, if the JSC would like to adopt our standard for wider use.
We now describe how the machine-readable metadata should be encoded in CF
compliant netCDF. TFSP recommends that any strings used remain as standardized
as possible, ie changes to case, spacing and abbreviations should be avoided. Over a
long period of time, institute names etc are likely to change (eg past changes of NMC
to NCEP), and it will be necessary to provide appropriate documentation of this to aid
data searching.
Outline of proposed CF-compliant netCDF data layout:
dimensions:
latitude=180, longitude=360, level=10,
time=184,initial_time=20,
forecast_number=5, ensemble_member_number=10;
string_max=16;
5
variables:
float latitude(latitude);
float longitude(longitude);
float level(level);
double time(time);
time:units=”days”;
time:standard_name=”time”;
time:long_name=”time elapsed since the start of the forecast”
double initial_time(initial_time);
initial_time:units=”days since 1900-01-01 00:00)0.0” ;
initial_time:standard_name=”forecast_reference_time”
int forecast_number(forecast_number);
char originating_centre(forecast_number,string_max);
char experiment_identifier(forecast_number,string_max);
int forecast_system_version_number(forecast_number);
int forecast_method_number(forecast_number);
char production_status(forecast_number,string_max);
char model_identifier(forecast_number,string_max);
char original_distributor(forecast_number,string_max);
char sst_specification(forecast_number,string_max);
(A)
(A)
(A)
(A)
(B)
(B)
(B)
(B)
int ensemble_member_number(ensemble_member_number);
float field(forecast_number,initial_time,ensemble_member_number,
time,level,latitude,longitude)
char real_time(forecast_number,initial_time,
ensemble_member_number,1)
(T or F)
char archive_date(forecast_number,initial_time,
ensemble_member_number,string_max)
(C)
(C)
Here we have chosen to code all of the metadata in the form of variables rather than
either global attributes (which are a file-based concept, and would restrict which data
could be served in a single file) or attributes of variables (which only works if the
attribute has a single value for all the relevant data in the file). The choice to use
variables fits with the philosophy of CF, and allows more flexibility when used with
appropriate applications, but can make datasets a little more awkward to use with
those applications which do not like multi-dimensional datasets.
Note our use of the standard names “time” and “forecast_reference_time” and
associated time units to define the two time axes of a multi-dimensional dataset. We
believe this is an appropriate way to code forecast data in CF compliant netCDF, even
though our ‘time’ units do not match the specification used for the IPCC data.
The “forecast_number” dimension is here given its own forecast_number variable, to
make explicit the fact that it is the defining variable for the data within a given file,
and that the other similarly-dimensioned variables (originating_centre through to
sst_specification) are auxiliary variables within the meaning of the CF convention.
The forecast_number is essentially a dummy variable that simply indexes the
6
forecast-defining ‘tuples’ within the file. Possibly it could be omitted from the
netCDF file.
The “ensemble_member_number” is the other independent dimension within the
netCDF file. If a multi-model ensemble is coded in a single file, and the ensemble
sizes vary between the models, then the netCDF file will be larger in size than is
strictly necessary to code the data, because it will reserve space for the same ensemble
size for all the forecast_number models. This is considered a tolerable state of affairs.
If instead of dimensioning the data with forecast_number, we used multiple
dimensions consisting of each of the defining metadata, then we could easily end up
with very large sparse files when coding multi-model data. The defining metadata
‘tuples’ are not natural hypercubes.
The auxiliary variables labelled (A) are the defining metadata, while those labelled
(B) contain comment information. There is an issue as to how to code the comment
metadata which is valid for each forecast integration separately (ie whether it was
made in real time, and the date-stamp of the data). Here we simply supply it in the
form of appropriately dimensioned variables real_time and archive_date. There is
nothing in the netCDF file to say that these are metadata describing the field variable.
However, it is just possible that these variables could be considered ancillary
variables to the field, within the meaning of the CF convention, despite the difference
of dimension. In this case, we could add the ‘ancillary_variables’ attribute to the field
variable, to make the link explicit.
Note that the above proposal is CF compliant, in that it does not introduce any new
standard_name attributes for variables. However, if we want to standardize the usage
that we propose here, such that application software can unambiguously interpret data
files that follow our proposal, then it would be desirable to ask for the CF convention
to be extended to cover our usage. Such a request might be made separately for the
simple concept of “ensemble_member_number” (necessary for any sort of ensemble
forecast to be represented, and presumably not controversial as a concept) and the
more complex forecast_number “tuple” needed to represent multi-model forecast
data. If CF approval were to be given, the above layout of variables would be
unchanged, but they would each be given standard_name attributes to allow
unambiguous processing of the data. In the absence of CF approval, we could ask that
equivalent “TFSP_standard_name” attributes be set in the netCDF file instead.
Examples of forecast defining “tuplets” and associated comment metadata:
originating_centre: COLA
experiment_identifier: expt_id (as assigned by COLA)
forecast_system_version_number: 5 (as assigned by COLA)
method_number: 1
production_status: research
model_identifier: CFS
original_distributor: COLA
sst_specification: coupled
originating_centre: IRI
experiment_identifier: 1 (recommended convention for operational systems; numbers
>1 for testing, non-numbers for research etc)
7
forecast_system_version_number: 1, 2 and 3 (for CCM3.2, ECHAM3.6, MRF9, as
would be documented by IRI on their website)
method_number: 1
production_status: operational
model_identifier: CCM3.2, ECHAM3.6, MRF9
original_distributor: IRI
sst_specification: predicted
For operational centres, each forecast model system is assigned a unique ‘version’
number (for example, in the order in which the models were introduced) and new
models and/or model versions get a new system version number. (Should we
recommend that an older forecast system will always have a smaller version number
than a newer one?) In research mode, however, typically the experiment_identifier
will be used to distinguish experiments with different forecast systems. If a research
user wants to use the system_version_number to distinguish between experiments
with different models and/or major new versions of a model, that is fine, but it is not
mandatory.
originating_centre: Met Office
experiment_identifier: 1
forecast_system_version_number: 3
method_number: 1
production_status: operational
model_identifier: HADCM3
original_distributor: ECMWF
sst_specification: coupled
For operational centres, it is recommended that a new system number should be given
whenever changes are sufficient to result in a new set of back integrations. For
example, switching to a new source of SST data in a real-time forecast system would
not trigger a new system number if the original back integrations continue to be used.
Even changes such as this should be documented on the centre’s web page, though.
originating_centre: ECMWF
experiment_identifier: common_ENSEMBLES_expt_id_1
forecast_system_version_number: 1 or 2
method_number: 1
production_status: ENSEMBLES
model_identifier: IFS/HOPE or IFS/OPA
original_distributor: ECMWF
sst_specification: coupled
originating_centre: Met Office
experiment_identifier: common_ENSEMBLES_expt_id_1
forecast_system_version_number: 1 or 2
method_number: 1-9 (for a 9 member ensemble with perturbed parameters)
production_status: ENSEMBLES
model_identifier: HADCM3 or HADGEM
original_distributor: ECMWF
sst_specification: coupled
8
For coordinated experiments such as ENSEMBLES, common experiment identifiers
might be agreed. Since the ‘production_status’ is not part of the defining metadata in
this schema, a small amount of care is needed to ensure that the experiment_identifier
does not end up overwriting other operational or research data.
3. Recommended file structure for data exchange
A convention on how seasonal forecast data should be described within a netCDF file
is still insufficient to describe the format in which data should be made available for
exchange. This is because large datasets can be placed into netCDF files in many
ways. In principle, well written software should be able to extract the information
from any set of conforming netCDF files, and place it in a standard archive file format
of the receiving cetnre’s choosing. However, it will make life easier for us all if we
agree a recommended file structure for data exchange. The files would be suitable for
archive without further processing (although if someone wants to store the data
differently, of course they are free to do so. If the data are served via OPeNDAP, then
the file structure is dictated in large part by the request. In this case, it is sufficient to
ensure that files of the recommended structure can be served by the OPeNDAP server.
Proposal: The seasonal forecast data should be provided in files, each of which
should contain data from a single forecast (ie a single originating centre, a single
experiment identifier, a single forecast system version, and a single method), a single
initial date, a single ensemble member, and a single field.
To simplify and quicken the handling of files, the file names should encode the
defining metadata in the following way:
ORIGIN_EXPVER_SYSVER_METHOD.YYYYMMDD.NNNN.field.nc
where ORIGIN, EXPVER, SYSVER and METHOD are 6 character strings,
YYYYMMDD is the initial data, NNNN is the ensemble number and ‘field’
represents the physical variable archived in the file. [details TBD].
4. Additional recommendations.
The experience of PCMDI in handling the latest IPCC data is worth considering. As
well as specifying the metadata content, a realization of the metadata in CF compliant
netCDF and rules on how the data were to be put into files, they also provided some
additional specifications. (See http://wwwpcmdi.lnl.gov/ipcc/IPCC_output_requirements.htm for their full specification). Do we
want to follow any of these, or make our own equivalent rules?
- Files no more than 2Gb in size.
- Data must be gridded using product of two Cartesian axes.
- Atmosphere data must be on standard pressure levels (exception for cloud)
- Ocean fields must be on depth levels, recommended to use standard depths
- Output fields to be single precision floating point
- Should variable names in the netCDF files bear the same relationship to the CF
standard names as mandated in the IPCC tables?
9
- Are the IPCC requirements on sign conventions and order of array dimensions
already covered by the CF standard?
- Double treatment of missing_value and fill_value, to help old software (or do we
just stick with the up-to-date CF way of doing things?)
- (Recommended) original_name attribute
- (Recommended) variable-specific history attribute
- (Recommended) original_units, long_name and comment attributes
Coordinate variables:
- Must be specified in double precision
- time unit to be “days since [basetime]”, where [basetime] is user supplied
Global attributes:
- institution: both abbreviation and full name and location
- source: specific instructions in building a long string
- project_id: “IPCC fourth assessment”
- realization: (an integer, specifiying which ensemble member)
-experiment_id: one of a set of specified strings, corresponding to the project
- (Recommended) contact: name and contact info eg email or telephone number
5. OPeNDAP
The details of OPeNDAP implementation may need to be discussed. It is desirable to
provide aggregation server functionality, so that a data request can be met by
extracting data from multiple files - for example, all the ensemble members from a
given forecast start date, or a multi-model dataset from a single start date. To do this,
the aggregation server will have to be configured to handle the multi-dimensional data
structure proposed above, and serve data in this form.
In general, it is desirable to archive the data on their native grids. However, it is also
desirable that the data can be served on specified regular grids, to allow easy
intercomparison and multi-model calculations to be made. For typical atmosphere
model grids, it may be feasible to obtain sub-area extraction and regridding to userspecified lat/long grids from existing software. For ocean grids, this is likely to be
more difficult. We may need to recommend that ocean data are supplied on regular
lat-long grids.
There is perhaps an issue of data security and integrity associated with distributed
data systems. We will need to ensure that one centre is not able to accidentally
“overwrite” data from another centre, by releasing data with the wrong metadata.
How do we ensure that something purporting to be an ECMWF forecast is not in fact
a forgery?
6. Data volumes
To be estimated. The data volumes for the full output specified in Trieste are quite
large. Although there is undoubted benefit in high frequency data for at least some
model runs (to allow study of high frequency processes) there is unlikely to be much
demand for full temporal resolution data for all ensemble members and all start dates
for all experiments.
10
7. Available capacity
To be assessed. It is important to make progress on this, to ensure that as a minimum
all data producers in TFSP have somewhere they can send their data to.
8. Quality control procedures
Every data serving centre should have data quality control procedures in place. If data
is acquired from elsewhere, ideally there should be some form of sanity checking /
human visual inspection to check that the data look OK (ie not zeros or garbage, at
least).
There should be an email address or other problem-reporting mechanism, so that users
of the data can report any problems. Possibly this could be provided in association
with the original_distributor metadata.
There should be some mechanism (as a minimum a web-viewable page of ‘incidents’)
for reporting to users any known problems or failures.
9. Relationship to other data strategies
a. Operational Met. services, THORPEX/TIGGE and the WMO Information
System.
Operational Met. services exchange model data in GRIB, and the advent of GRIB
edition 2 increases the ability of the GRIB standard to handle data in ways that TFSP
may require. Nonetheless, there is a strong consensus within CLIVAR that netCDF is
a much preferred format when to comes to ease of use and acceptance by the research
community. Since it is this research community that will be creating and in particular
analysing the data from TFSP, it is clear that the initial data exchange should be in
netCDF.
The advent of TIGGE (THORPEX Interactive Grand Global Ensemble), an
experimental real-time multi-model medium range forecasting system being created
by the major NWP centres around the world, has given operational centres a need to
exchange multi-model forecast data. They have responded to this by setting up a data
committee, and will implement an initial system which exchanges data in GRIB2, and
has three global centres which will provide a parallel archive of the data (NCAR,
ECMWF and CMA). Data exchange will either be with ftp or more advanced
software such as IDD/LDM.
Although there are important differences between the TIGGE data exchange and that
envisioned for TFSP (real-time, large volume exchanges vs delayed mode transfer;
GRIB2 vs netCDF, operational vs. research community), there is the opportunity to
coordinate certain aspects in a way that facilitates both present-day interoperability
and possible future convergence of operational and research data systems. A key part
of the design of both systems is the required metadata content. If an explicit
11
isomorphism could be made between the TFSP proposal and the developing
operational application of GRIB2 to multi-model forecasting, this would
substantially aid automatic translation between the formats used, and make it easy to
use common underlying data systems for both projects. On the TFSP side, such a
solution would maintain the advantages of netCDF for the research community
(familiarity in the community; human readability, CF compliance), while introducing
some advantages of WMO operational codes (in particular, unambiguous processing
and archiving). In fact, a complete isomorphism does not appear to be easily achieved
in the short term. The defining metadata needed by TFSP go beyond that proposed by
TIGGE, and TIGGE does not appear to envisage the useful ‘comment’ metadata
which we discuss here. There are also issues with finding a flexible but unambiguous
method of identifying research originating centres. Nonetheless, the proposal outlined
here already attempts to harmonize some of the metadata language, and since both
TFSP and TIGGE envisage future evolution of their data systems, there is certainly
scope for cooperation. TFSP and other elements of WCRP will strive to collaborate
with TIGGE and the operational WMO community.
The XIV WMO Congress in 2003 mandated the creation of a new WMO Information
System, which will provide “a single coordinated global infrastructure for the
collection and sharing of information in support of all WMO and related international
programmes”. The FWIS (as it is generally known) is still largely a concept, although
work on prototype systems has started in Europe. The aim appears to be for a system
that can handle various data formats (including netCDF) and use various networks,
including the internet. The FWIS is no help to us at the moment, but at some point in
the future may provide powerful tools for exchanging and archiving data for WCRP
projects. Our initial strategy is to use existing tools (OPeNDAP, possibly ftp) to
transfer and serve the model data, because of the need to have something working
immediately.
b. The academic and research community (WMP, WGCM, PCMDI, ….)
The academic and research community are almost universally at home with using
netCDF data, hence the strong requirement to make data available in netCDF. Beyond
the CF standard, there is not a strong tradition of experiment-defining metadata. The
additional dimensionality of the data can causes awkwardness with some netCDF
analysis applications. Nonetheless, it is expected that the proposed metadata will not
cause any major problems. Regardless of the data format, software will need to be
further developed to fully analyze multi-model ensemble forecasts.
WMP (WCRP Modelling Panel), at its first meeting in October 2005, commissioned a
white paper on data management issues within COPES (of which TFSP is a part).
This has been produced, and provides a suggested outline for the data systems that
will be needed by 2015. As it happens, their vision is very compatible with our
proposal here. In particular, they envisage distributed data systems (with provision for
groups who don’t want to serve their own data), and stress the importance of metadata
standards. They suggest that WMP should consider the adoption of standards and
conventions that establish requirements for coordinated experiments, which is pretty
much what we are providing here for TFSP.
WGCM are also devoting effort to data issues, and have proposed that a new CF
oversight panel should be set up under their auspices. TFSP needs to ensure that our
12
efforts are coordinated with those of WGCM and WMP, to build a proper pan-WCRP
approach to data handling.
PCMDI have much experience of handling large multi-institutional datasets for
internationally coordinated experiments. In particular, at the request of WGCM, they
have created an organised system for the handling of data from experiments for the
IPCC fourth assessment report. Although we envisage a shared rather than centralized
data service, and although the specifics of the metadata requirements for TFSP are
somewhat different from those for IPCC, what we have outlined can be viewed as
following in the footsteps of what PCMDI provided for the fourth assessment report.
Points for discussion
We need to decide on maximum string lengths for the metadata. Should they be
imposed at all? (It makes file handling much easier if file names have a fixed format).
Should they be relatively short (eg 6 chars, nicer file names) or relatively long
(allowing data producers more freedom in choosing names). Do we allow strings
containing blanks, and how do we then handle the file names?
Is the proposed ‘tuplet’ for defining the forecasts adequate and appropriate? In the
ECMWF internal archive something equivalent to ‘production_status’ is part of the
defining metadata, and is not just descriptive. This appears to be unnecessary, but it
would give more flexibility in the assignment of experiment identifiers. Is the strategy
of minimizing the number of ‘tuplet’ components a good one?
We need to have good provision for those who will not serve their own data, and
identification of willing ‘hosts’ is an early priority.
References:
NCAR CF pages:
http://www.cgd.ucar.edu/cms/eaton/cf-metadata/index.html
http://www.cgd.ucar.edu/cms/eaton/cf-metadata/conformance-req.html
BADC pages on CF, with discussion and examples:
http://badc.nerc.ac.uk/help/formats/netcdf/index_cf.html
http://badc.nerc.ac.uk/help/formats/netcdf/cf_examples.html
White paper on future of CF:
http://www.cgd.ucar.edu/cms/eaton/cfmetadata/CF2_Whitepaper_PublicDraft01.pdf
PCMDI pages on IPCC data handling:
http://www-pcmdi.lnl.gov/ipcc/about_ipcc.php
http://www-pcmdi.lnl.gov/ipcc/IPCC_output_requirements.htm
WMP report on data issues within WCRP and COPES:
http://copes.ipsl.jussieu.fr/Organization/COPESStructure/Reports/WMP1
/Report_Kinter_TaylorReport.pdf
13
Download