Data Product Configuration Management and Versioning in Large

advertisement

Data Product Configuration Management and Versioning in Large-Scale Production of Satellite Scientific Data

Production

Bruce R. Barkstrom

Head, Atmospheric Sciences Data Center

NASA Langley Research Center

Hampton, VA 23681-2199

Large-scale scientific data production for NASA’s Earth observing satellite instruments provides a number of interesting problems for configuration management and versioning.

In several instances, this activity involves production of tens of thousands of files per day from tens or hundreds of different data sources. Within the production, there may be highly complex production flows, with high-volume external data ingest.

Typically, the data producers divide the parameters they create from the observations into different types of files, each of which may contain one or more measurement types or parameters. These file types are typically known as Data Products . The producers categorize them according to the closeness to the original data source:

 Level 0 data being the ‘raw’ data from the satellite

Level 1 data being calibrated and geolocated, keeping the original sampling pattern

Level 2 data being converted into geophysical parameters but still with the original sampling pattern

Level 3 data being resampled, averaged over space, and interpolated/averaged over time

In many cases, the scientists producing these data intend the output of this production effort to be collections of files that represent ‘homogeneous’ observations of the Earth – even though there are several instruments on different satellites generating the original measurements. For configuration management, this means that the data producers distinguish Data Sets within a data product – with each Data Set having a distinct measurement source. Within the Earth Observing System, the data producers thus have

Data Sets arising from instruments on the Terra satellite and from instruments on the

Aqua satellite.

In this context, a Data Set would be regarded as being further subdivided into Data Set

Versions , for which the algorithms and input parameters were constrained to produce data with ‘homogeneous’ uncertainties. Roughly, ‘homogeneity’ within a Data Set Version means that the algorithms in the source code are kept fixed for production of the version and that the input parameters are organized to ensure that the parameters in the version’s files have similar error properties.

The data producers can also distinguish production Variants within a Version. The files in a particular Variant will be produced by a constant production configuration. By this we mean that within a Variant, the operating system, compiler, scripts, and related sources of variation are held constant. From the standpoint of data users, Variants within a Data Set Version are scientifically indistinguishable – whereas Data Set Versions within a Data Set are distinguishable in important ways.

This hierarchy of file collections (Data Product, Data Set, Data Set Version, Data Set

Version Variant) has interesting implications for configuration management. If we were dealing with printed texts, a version might be called an edition, with a large number of essentially identical copies of the text being contained in a version. If we were dealing with source code for a software product, a version typically denotes a fixed text which is compiled and linked into an executable object file from which a large number of essentially identical copies are made available to consumers. Our familiarity with software versioning leads to an expectation that versions have a sequential character, so that they can be indexed by a sequential numbering system. Practical experience leads to a requirement to track major versions and minor ones, typically by using a multi-position notation, such as 1.1, 2.3.1, etc.

In contrast, the files within a Version of these Earth science Data Products are each distinct. Furthermore, while software versioning requires tracking primarily changes in the source code, data product versioning requires tracking changes in source code, in input files of predecessor data products, and in input files of algorithm coefficients.

Changes in any of these items can induce scientifically important changes in the data within a Data Set Version. Thus, a noticeable change in the calibration coefficients can produce not only a new Version of a Level 1 Data Set, but also force the producer to create new Versions of the Level 2 and Level 3 Data Sets – even if the source code is completely unchanged for the subsequent data production.

As a practical matter, the hierarchical nature of this approach to organizing file collections has several implications for configuration management.

First, it encourages a hierarchical approach to data system design. A data producer can start by identifying the Algorithm Collections that produce one data product from predecessor products. This connectivity is easy to represent as a Data Flow Diagram. At the next stage in design, the producer can particularize each Algorithm Collection into a

Subsystem Design Template (SDT), adding the input Coefficient Files and output Quality

Control (QC) data needed for production monitoring. At this stage, the producer can also identify natural groupings of SDT’s that form Production Flow Templates (PFT’s). A

PFT can be thought of as a graph whose nodes are data products and whose arcs are

SDT’s. The SDT’s also serve as templates for more detailed code design. This fact allows the data producer to estimate the amount of software development at an early stage of the design.

At the next stage, the data producer needs to identify the Data Sets to be produced. From a design standpoint, this identification leads to specifying a Subsystem Design that inherits the connectivity provided by the parent SDT. Just as a Data Set inherits the properties of a Data Product, so the Subsystem Coefficient files and Subsystem QC data must inherit properties of Coefficient files and QC data. In some cases, a data producer can use a single Subsystem Design to create all the Data Sets of a given Data Product. In others, the producer will need separate Subsystem Designs for each Data Set. The former possibility applies, for example, if the Subsystem converting Level 0 data to calibrated and geolocated data is working on data coming from a single instrument design. The latter possibility applies if a Subsystem is working on data from several different kinds of instruments, as might occur when imagers on different spacecraft are used for identifying clouds that condition the interpretation of data from an instrument with coarser spatial resolution.

Second, this specialization of design can continue to the point at which production control is dealing with individual files. At this atomistic level, each ‘job’ can be specified as a Production Graph that identifies a list of input files, the executable objects that act on the input files, and a list of output files. Because the number of files required for some jobs such as producing monthly averages from daily or hourly observations can be very large, it may be necessary to create ‘staging jobs’ that move data from tertiary storage to disk and ‘destaging jobs’ that do the inverse. Data producers can then identify

‘job collections’ that are naturally passed to dispatchers for detailed scheduling.

Third, this approach leads naturally to data provenance tracking for individual files by assembling the predecessor Production Graphs into a Provenance Graph that identifies all of the processes and input files that generated the data in the file of interest.

Finally, the structure we have suggested leads to a convenient separation between the data structure needed to describe the properties of file collections and the structure needed to describe the summary properties (metadata) for each file. In other words, the structure that describes the file collections is small enough that it can be readily distributed to many sites, even if the file metadata is large enough to represent significant difficulties for practical redistribution (some of the NASA EOS data production scenarios can create two million files per instrument-month of production). In addition, the hierarchical relationship between Data Products, Data Sets, Data Set Versions, and Data

Set Version Variants also makes it easy to present these collections in the familiar

TreeView format used in presenting directory and file structures in commercial operating systems.

The approach suggested here leads naturally to Version names that express causal rationales for the differences between one Version and the next, rather than a numerical sequence of values. It is particularly important to note that a naming approach to Version names should improve the ability of the data producer to point users to similar versions from different data sources, an particularly useful property for long-term observational data sets.

Download