Large Synoptic Survey Telescope (LSST) Data Management Middleware Design Kian-Tat Lim, Ray Plante, Gregory Dubois-Felsmann LSE-152 July 25, 2011 i LSST Data Management Middleware Design LSE-152 07/25/2011 Change Record Version 1.0 Date 7/25/2011 Description Initial version based on pre-existing UML models and presentations ii Owner name Kian-Tat Lim LSST Data Management Middleware Design LSE-152 07/25/2011 Table of Contents Change Record ...................................................................................................................... ii 1 Executive Summary ........................................................................................................ 1 2 Introduction ................................................................................................................... 1 3 02C.06.02.01 Database and File Access Services .............................................................. 2 3.1 4 02C.07.01.01 Control and Management Services ............................................................. 5 4.1 4.2 4.3 5 Policy Framework................................................................................................................8 Pipeline Harness................................................................................................................ 10 02C.07.01.03 Pipeline Execution Services ...................................................................... 12 6.1 7 Event Subsystem .................................................................................................................5 Orchestration ......................................................................................................................6 Data Management Control System ......................................................................................7 02C.07.01.02 Pipeline Construction Toolkit ..................................................................... 8 5.1 5.2 6 Data Access Framework I/O Layer........................................................................................2 Logging Subsystem ............................................................................................................ 12 02C.07.01.07 File System Services ................................................................................. 12 7.1 Data Access Framework Replication Layer ......................................................................... 12 iii LSST Data Management Middleware Design LSE-152 07/25/2011 The LSST Data Management Middleware Design 1 Executive Summary The LSST middleware is designed to isolate scientific applications, including the Alert Production, Data Release Production, Calibration Products Production, and Level 3 processing, from details of the underlying hardware and system software. It enables flexible reuse of the same code in multiple environments ranging from offline laptops to shared-memory multiprocessors to grid-accessed clusters, with a common communication and logging model. It ensures that key scientific and deployment parameters controlling execution can be easily modified without changing code but also with full provenance to understand what environment and parameters were used to produce any dataset. It provides flexible, high-performance, low-overhead persistence and retrieval of datasets with data repositories and formats selected by external parameters rather than hard-coding. Middleware services enable efficient, managed replication of data over both wide area networks and local area networks. 2 Introduction This document describes the baseline design of the LSST data access and processing middleware, including the following elements of the Data Management (DM) Construction Work Breakdown Structure (WBS): 02C.06.02.01 Database and File Access Services 02C.07.01.01 Control and Management Services 02C.07.01.02 Pipeline Construction Toolkit 02C.07.01.03 Pipeline Execution Services 02C.07.01.07 File System Services The LSST database design, which contributes to WBS element 02C.06.02.01 and other elements within 02C.06.02, may be found in the document entitled “Data Management Database Design” (LDM-135). WBS element 02C.07.04 is described in “LSST Cybersecurity Plan” (LSE-99). 02C.07.05 (visualization) and 02C.07.06 (system administration) are primarily low-level, off-the-shelf tools and are not described further here. 02C.07.08 (environment and tools) includes similar off-the-shelf tools as well as testbeds and other primarily-hardware elements. 1 LSST Data Management Middleware Design LSE-152 07/25/2011 Figure 1. Data Management System Layers. Common to all aspects of the middleware design is an emphasis on flexibility through the use of abstract, pluggable interfaces controlled by managed, user-modifiable parameters. In addition, the substantial computational and bandwidth requirements of the LSST Data Management System (DMS) force the designs to be conscious of performance, scalability, and fault tolerance. In most cases, the middleware does not require advances over the state of the art; instead, it requires abstraction to allow for future technological change and aggregation of tools to provide the necessary features. 3 02C.06.02.01 Database and File Access Services This WBS element contains the I/O layer of the Data Access Framework (DAF). 3.1 Data Access Framework I/O Layer This layer provides access to local resources (within a data access center, for example) and low-performance access to remote resources. These resources may include images, non-image files, and databases. Bulk data transfers over the wide-area network (WAN) and high-performance access to remote resources are provided by the replication layer within 02C.07.07 File System Services. 3.1.1 Key Requirements The DAF I/O layer must provide persistence and retrieval capabilities to application code. Persistence is the mechanism by which application objects are written to files in some format or a database or a 2 LSST Data Management Middleware Design LSE-152 07/25/2011 combination of both; retrieval is the mechanism by which data in files or a database or a combination of both is made available to application code in the form of an application object. Persistence and retrieval must be low-overhead, allowing efficient use of available bandwidth. The interface to the I/O layer must be usable by application developers. It is required to be flexible, allowing changes in file formats or even whether a given object is stored in a file or the database to be selected at runtime in a controlled manner. Image data must be able to be stored in standard FITS format, although the metadata for the image may be in either FITS headers or database table entries. 3.1.2 Baseline Design We designed the I/O layer to provide access to datasets. A dataset is a logical grouping of data that is persisted or retrieved as a unit, typically corresponding to a single programming object or a collection of objects. Dataset types are predefined. Datasets are identified by a unique identifier. Datasets may be persisted into multiple formats. A Formatter subclass is responsible for converting the in-memory version of an object to its persisted form (or forms), represented by a Storage subclass, and vice versa. The Storage interface may be thin (e.g. providing a file's pathname) or thick (e.g. providing an abstract database interface) depending on the complexity of the persisted format; all Formatters using a Storage are required to understand its interface, but no application code need do so. One Storage will represent the publish/subscribe interface used by the camera data acquisition system to deliver image data. A Storage is configured with a LogicalLocation to indicate where the object resides. Formatters and Storages are looked up by name at runtime, so they are fully pluggable. Formatters may make use of existing I/O libraries such as cfitsio, in which case they are generally thin wrappers. Formatters are configured by Policies. All persistence and retrieval is performed under the control of a Persistence object. This object is responsible for interpreting the overall persistence Policy, managing the lookups and invocations of Formatters and Storages, and ensuring that any transaction/rollback handling is done correctly. 3 LSST Data Management Middleware Design LSE-152 07/25/2011 class Persistence Design Pattern Persistence Persists Persistence Policy Persistable PersistenceStorage «entity» Policy Storage A PersistenceFormat LogicalLocation Formatter Figure 2. Data Access Framework I/O Layer Components. 3.1.3 Alternatives Considered Use of a full-fledged object-relational mapping system for output to a database was considered but determined to be too heavyweight and intrusive. 3.1.4 Prototype Implementation A C++ implementation of the design was created for Data Challenge 2 (DC2) and has evolved since then. Formatters for images and exposures, sources and objects, and PSFs have been created. Datasets are identified by URLs. Storage classes include FITS1 files, Boost::serialization2 streams (native and XML), and the MySQL3 database (via direct API calls or via an intermediate, higher-performance, bulk-loaded tab-separated value file). The camera interface has not yet been prototyped. This implementation has been extended in DC3 to include a Python-based version of the same design that uses the C++ implementation internally. In the Python version, a Data Butler plays the role of the Persistence object. It takes dataset identifiers that are composed of key/value pairs, with the ability to infer missing values as long as those provided are unique. An internal Mapper class uses a Policy to control the format and location for each dataset. A Python-only Storage class has been added to allow persistence via the Python "pickle" mechanism. 1 2 3 http://fits.gsfc.nasa.gov/ http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/index.html http://www.mysql.com/ 4 LSST Data Management Middleware Design LSE-152 07/25/2011 4 02C.07.01.01 Control and Management Services 4.1 Event Subsystem The event subsystem is used to communicate among components of the DM System, including between pipelines in a production. A monitoring component of the subsystem can execute rules based on patterns of events, enabling fault detection and recovery. 4.1.1 Key Requirements The event subsystem must reliably transfer events from source to multiple destinations. There must be no central point of failure. The subsystem must be scalable to handle high volumes of messages, up to tens of thousands per second. It must interface to languages including Python and C++. A monitoring component must be able to detect the absence of messages within a given time window and the presence of messages (such as logged exceptions) defined by a pattern. 4.1.2 Baseline Design The subsystem will be built as a wrapper over a reliable messaging system such as Apache ActiveMQ4. Event subclasses and standardized metadata will be defined in C++ and wrapped using SWIG5 to make them accessible from Python. Events will be published to a topic; multiple receivers may subscribe to that topic to receive copies of the events. The event monitor subscribes to topics that indicate faults or other system status. It can match templates to events, including boolean expressions and time expressions applied to event data and metadata. Figure 3. Event Subsystem Components 4 5 http://activemq.apache.org http://www.swig.org 5 LSST Data Management Middleware Design LSE-152 07/25/2011 Observatory Control System (OCS) messages destined for the DM System will be translated into DM Event Subsystem events by dedicated software (part of the DMCS, see section 4.3) and published to appropriate topics. 4.1.3 Prototype Implementation An implementation of the event subsystem on Apache ActiveMQ was created for DC2 and has evolved since then. Command, Log, Monitor, PipelineLog, and Status event types have been defined. Event receivers include pipeline components, orchestration components, the event monitor, and a logger that inserts entries into a database. The event monitor has been prototyped in Java. The OCS message translator has not yet been prototyped. 4.2 Orchestration The orchestration layer is responsible for deploying pipelines and Policies onto nodes, ensuring that their input data is staged appropriately, distributing dataset identifiers to be processed, recording provenance, and actually starting pipeline execution. 4.2.1 Key Requirements The orchestration layer must be able to deploy pipelines and their associated configuration Policies onto one or more nodes in a cluster. Different pipelines may be deployed to different, although possibly overlapping, subsets of nodes. All four pipeline execution models (see section 5.2.1) must be supported. Sufficient provenance information must be captured to ensure that datasets can be reproduced from their inputs. The orchestration layer at the Base Center works with the DM Control System (DMCS, see section 4.3) at that Center to accept commands from the OCS to enter various system modes such as Nightly Observing or Daytime Calibration. The DMCS invokes the orchestration layer to configure and deploy the Alert Production pipelines accordingly. At the Archive Center, the orchestration layer manages execution of the Data Release Production, including sequencing scans through the raw images in spatial and temporal order. Orchestration must detect failures, categorize them as permanent or possibly-transient, and restart transiently-failed processing according to the appropriate fault tolerance strategy. 4.2.2 Baseline Design The design for the orchestration layer is a pluggable, Policy-controlled framework. Plug-in modules are used to configure and deploy pipelines on a variety of underlying process management technologies 6 LSST Data Management Middleware Design LSE-152 07/25/2011 (such as simple ssh6 or more complex Condor-G7 glide-ins), which is necessary during design and development when hardware is typically borrowed rather than owned. Additional modules capture hardware, software, and configuration provenance, including information about the execution nodes, the versions of all software packages, and the values of all configuration parameters for both middleware and applications. This layer monitors the availability of datasets and can trigger the execution of pipelines when their inputs become available. It can hand out datasets to pipelines based on the history of execution and the availability of locally-cached datasets to minimize data movement. Faults are detected by the pipeline harness and event monitor timeouts. Orchestration then reprocesses transiently-failed datasets on already-deployed pipelines or on new pipeline instances that it deploys for the purpose. 4.2.3 Prototype Implementation A prototype implementation of the deployment framework was developed for DC3a. It was extended to handle Condor-G, and data dependency features were added for DC3b. Full fault tolerance has not yet been prototyped, although a limited application of a fault tolerance strategy has been demonstrated. Provenance is recorded in files and, to a limited extent, in a database. The file-based provenance has been demonstrated to be sufficient to regenerate datasets. 4.3 Data Management Control System The LSST DMS at each center will be monitored and controlled by a Data Management Control System (DMCS). 4.3.1 Key Requirements The DMCS at each site is responsible for initializing and running diagnostics on all equipment, including computing nodes, disk storage, tape storage, and networking. It establishes and maintains connectivity with the other sites including the Headquarters Site. It monitors the operation of all hardware and integrates with the orchestration layer (see section 4.2) to monitor software execution. System status and control functions will be available via a Web-enabled tool to the Headquarters Site and remote locations. At the Base Center, the DMCS is responsible for interfacing with the OCS (as defined in “Control System Interfaces between the Telescope & Data Management”, Document LSE-75). It accepts commands from the OCS to enter various modes, including observing, calibration, day, maintenance, and shutdown. It then configures and invokes the orchestration layer and the replication layer (see section 7.1) to enable 6 7 http://openssh.com/ http://www.cs.wisc.edu/condor/condorg/ 7 LSST Data Management Middleware Design LSE-152 07/25/2011 the necessary processing and data movement for each mode, including running the Alert Production and replicating raw images to the Archive Center, respectively. At the Archive Center, the DMCS performs resource management for the compute cluster. Parts of the cluster may be dedicated to certain activities while others operate in a shared mode. The major processing activities under DMCS control, invoked using the orchestration layer, include the Alert Production reprocessing (on dedicated hardware), the Calibration Products Production, and the Data Release Production. The DMCS also initializes the replication layer to enable the archiving of raw images received from the Base Site. At each Data Access Center, the DMCS performs resource management for the level 3 data products compute cluster. It also initializes the replication layer to enable the distribution of level 1 data products received from the Base Center or the Archive Center. 4.3.2 Baseline Design The DMCS will consist of an off-the-shelf cluster management system together with a custom pluggable software framework. A Web-based control panel and an off-the-shelf monitoring system will also be integrated. Plugins will include job management systems like Condor, mode transition scripts to interface with the OCS and control panel, and hardware-specific initialization and configuration software. 5 02C.07.01.02 Pipeline Construction Toolkit 5.1 Policy Framework The Policy component of the Pipeline Framework is of key importance throughout the LSST middleware. Policies are a mechanism to specify parameters for applications and middleware in a consistent, managed way. The use of Policies facilitates runtime reconfiguration of the entire system while still ensuring consistency and the maintenance of traceable provenance. 5.1.1 Key Requirements Policies must be able to contain parameters of various types, including at least strings, booleans, integers, and floating-point numbers. Ordered lists of each of these must also be supported. Each parameter must have a name. A hierarchical organization of names is required so that all parameters associated with a given component may be named and accessed as a group. There must be a facility to specify legal and required parameters and their types and to use this information to ensure that invalid parameters are detected before code attempts to use them. Default values for parameters must be able to be specified; it must also be possible to override those default values, potentially multiple times (with the last override controlling). 8 LSST Data Management Middleware Design LSE-152 07/25/2011 Policies and their parameters must be stored in a user-modifiable form. It is preferable for this form to be textual so that it is human-readable and modifiable using an ordinary text editor. It must be possible to save sufficient information about a Policy to obtain the value of any of its parameters as seen by the application code. 5.1.2 Baseline Design The design follows straightforwardly from the requirements. Policies are specified by a text file containing hierarchically-organized name/value pairs. A value may be another Policy (referred to as a sub-Policy). A value may also be a list of values (all of the same type). Policies may reference other Policies to set values for sub-Policies. A Dictionary, which is also a Policy, specifies the legal parameter names, their types, minimum and maximum lengths for list values, and whether a parameter is required. Since Dictionaries are Policies, they may use Policy references to incorporate other dictionaries to validate sub-Policies. Each piece of application code (routine or object) using a Policy will typically have an associated Dictionary to validate the Policy parameters and provide default values. Default values may also be provided by the code’s caller, adding to or overriding the Dictionary defaults. With text-file Policies, the complete parameter state of a given execution may be preserved by preserving all the text files. In addition, the simple hierarchical syntax lends itself to storage in a database as a key/value table with dotted-name keys, allowing queries of the parameters by name (including the use of regular expressions) and value. 5.1.3 Prototype Implementation An implementation of Policy using a simple "name: value" syntax with brace-delimited sub-Policies has been in use since DC2. Hierarchical names are specified using dotted-path notation. XML syntax was considered but determined to be too wordy and difficult to edit. The dotted-path notation does not currently support referring to individual list elements. Dictionaries have been implemented with validation for fixed parameter names. Extending this validation to variable parameter names (e.g. for parameters pertaining to pluggable measurement algorithms) has not yet been implemented. Automatic merging of overrides and validation of the result is also currently unimplemented; instead, application code must merge default values into an incoming Policy using an API call. Inter-Policy references are implemented using pathnames or references that can locate Policies with respect to their containing software packages. Provenance code can load a complete set of Policies into a set of database tables for querying. Loading of simple lists of values is supported, but loading of lists of sub-Policies has not yet been implemented. 9 LSST Data Management Middleware Design LSE-152 07/25/2011 5.2 Pipeline Harness A pipeline is a very common representation of astronomical processing. Datasets are processed by a series of components in turn. Each component applies an algorithm to one or more input datasets, producing one or more outputs that are handed to the next component. The pipeline harness provides the ability to create these pipelines. 5.2.1 Key Requirements The pipeline harness must allow components to be specified in Python. It must handle the transfer of datasets from component to component. To ensure adequate performance for the Alert Production, such data transfer must be possible in memory, not solely through disk files. Pipeline components must be able to report errors and thereby prevent the execution of downstream components. The pipeline harness must support execution in four modes: 5.2.2 Single task (serial mode). One pipeline instance executes on one dataset. This mode is useful for development, testing, and debugging. Single task (parallel mode). Multiple linked pipeline instances execute on multiple datasets belonging to a single task while communicating amongst themselves and synchronizing when appropriate. This mode is required for real-time alert processing. Multiple tasks (batch mode). Multiple pipeline instances execute on one dataset each. Instances are independent of each other except that an instance may not be executed until all of its inputs are available. Instances may be executing different code to perform different tasks. This mode is required for some types of Data Release processing. Multiple tasks (data-sensitive mode). Multiple long-lived pipeline instances execute on multiple datasets, with dataset assignments to pipelines depending on past history, enabling repeatedly-used data to be kept in memory or at least on the node. Instances may again be executing different code for different tasks. This mode is required for some types of data-intensive Data Release processing. Baseline Design The pipeline harness is comprised of Pipeline, Slice, and Stage objects. A Pipeline is responsible for starting, stopping, and synchronizing Slices. Each Slice is a linear, unbranched sequence of Stages; all Slices run in parallel. The configuration of each Stage is controlled by a Policy. Each Stage wraps a single algorithmic process. Stages may pass in-memory objects to downstream Stages. 10 LSST Data Management Middleware Design LSE-152 07/25/2011 Figure 4. Pipeline Harness Components. 5.2.3 Prototype Implementations One implementation of the harness design has been developed in C++ and Python. Slices can communicate with each other via a thread model or via MPI8, or none at all. The sequence of Stages is defined by a Policy that also contains the Stage configuration Policies. Stages receive data via an in-memory Clipboard that contains name/value pairs; they process this data and place their output (or transformed input) onto the Clipboard for the next Stage. Items on the Clipboard may be transmitted from one Slice to another using the inter-Slice communication mechanism; Slices are addressed by topological labels. Two Stages have been implemented to interface with the Data Access Framework I/O layer (see section 3.1) to persist and retrieve datasets. An additional stage can be used to send Events (see section 4.1) to other pipelines or to the orchestration layer. Other stages interface with the orchestration layer (see section 4.2) via Events to define the datasets to be operated on and report errors in the pipeline, which are transmitted from the Stage to the Pipeline via a Python exception. Another implementation has been developed in Python. This implementation is currently limited to the single-task serial and multiple-task batch modes of operation with one Slice; it is primarily intended for development and debugging purposes. It can be extended to use thread-based or MPI-based communication in the future. Stages are implemented by Python classes; the sequence of Stages is specified by a Python script rather than a Policy, adding more programmability. Stages pass data through in-memory Python variables. Direct calls to the Data Access Framework are made to persist and retrieve datasets, and errors are reported through normal Python exceptions. 8 http://mpi-forum.org 11 LSST Data Management Middleware Design LSE-152 07/25/2011 6 02C.07.01.03 Pipeline Execution Services 6.1 Logging Subsystem The logging subsystem is used by application and middleware code to record status and debugging information. 6.1.1 Key Requirements Log messages must be associated with component names organized hierarchically. Logging levels controlling which messages are produced must be configurable on a per-component level. There must be a way for messages that are not produced to not add overhead. Logs must be able to be written to local disk files as well as sent via the event subsystem. Metadata about a component's context, such as a description of the CCD being processed, must be able to be attached to a log message. 6.1.2 Baseline Design Log objects are created in a parent/child hierarchy and associated with dotted-path names; each such Log and name has an importance threshold associated with it. Methods on the Log object are used to record log messages. One such method uses the C++ varargs functionality to avoid formatting the message until it has been determined if the importance meets the threshold. Log messages are contained within LogRecords that have additional key/value contextual metadata. Multiple LogDestination streams can be created and attached to Logs (and inherited in child Logs). Each such stream has its own importance threshold. LogRecords may also be formatted in different ways depending on the LogDestination. LogRecords may also incorporated into Events (see section 4.1) and transmitted on a topic. Two sets of wrappers around the basic Log objects simplify logging start/stop timing messages and allow debug messages to be compiled out. 6.1.3 Prototype Implementation A prototype implementation was created in C++ for DC2; the debugging and logging components of that implementation were merged for DC3a. The C++ interface is wrapped by SWIG into Python. 7 02C.07.01.07 File System Services 7.1 Data Access Framework Replication Layer The DAF replication layer moves data from site to site over the WAN, including providing for image regeneration and caching. 12 LSST Data Management Middleware Design LSE-152 07/25/2011 7.1.1 Key Requirements The replication layer must be able to reliably and rapidly move files from one site to another. It must be fully automatable and monitorable, and it must scale to the billions of files that the DMS will contain. It must be able to drive the retrieval of archived raw images, their processing into calibrated science images (CSIs), the caching of the resulting images, and the cutting out of regions from those images for access by remote users. 7.1.2 Baseline Design The iRODS9 rule-based trigger mechanisms and its pluggable micro-services capability are ideal for handling inter-site transfer tasks. The iRODS feature set matches well with our requirements. Accordingly, it is the baseline for this layer. 7.1.3 Prototype Implementation We tested iRODS in 2006 to assess its bandwidth efficiency and ability to sustain high transfer rates. The test successfully transferred image data at the rate of 6 TB/day (more than 10% of the LSST base-to-archive average rate). We also attempted to use iRODS to transfer image simulation results, learning that improper configuration could cause performance and usability difficulties. We are continuing to experiment with packages that could also meet our requirements such as the REDDnet10 distributed storage facility, the Xrootd11 scalable data access system, and other wide-area parallel filesystems. 9 http://www.irods.org/ http://www.reddnet.org/ 11 http://xrootd.slac.stanford.edu/ 10 13