Data Management Middleware Design

advertisement
Large Synoptic Survey Telescope (LSST)
Data Management Middleware Design
Kian-Tat Lim, Ray Plante, Gregory Dubois-Felsmann
LSE-152
July 25, 2011
i
LSST Data Management Middleware Design
LSE-152
07/25/2011
Change Record
Version
1.0
Date
7/25/2011
Description
Initial version based on pre-existing UML models
and presentations
ii
Owner name
Kian-Tat Lim
LSST Data Management Middleware Design
LSE-152
07/25/2011
Table of Contents
Change Record ...................................................................................................................... ii
1
Executive Summary ........................................................................................................ 1
2
Introduction ................................................................................................................... 1
3
02C.06.02.01 Database and File Access Services .............................................................. 2
3.1
4
02C.07.01.01 Control and Management Services ............................................................. 5
4.1
4.2
4.3
5
Policy Framework................................................................................................................8
Pipeline Harness................................................................................................................ 10
02C.07.01.03 Pipeline Execution Services ...................................................................... 12
6.1
7
Event Subsystem .................................................................................................................5
Orchestration ......................................................................................................................6
Data Management Control System ......................................................................................7
02C.07.01.02 Pipeline Construction Toolkit ..................................................................... 8
5.1
5.2
6
Data Access Framework I/O Layer........................................................................................2
Logging Subsystem ............................................................................................................ 12
02C.07.01.07 File System Services ................................................................................. 12
7.1
Data Access Framework Replication Layer ......................................................................... 12
iii
LSST Data Management Middleware Design
LSE-152
07/25/2011
The LSST Data Management Middleware Design
1 Executive Summary
The LSST middleware is designed to isolate scientific applications, including the Alert Production, Data
Release Production, Calibration Products Production, and Level 3 processing, from details of the
underlying hardware and system software. It enables flexible reuse of the same code in multiple
environments ranging from offline laptops to shared-memory multiprocessors to grid-accessed clusters,
with a common communication and logging model. It ensures that key scientific and deployment
parameters controlling execution can be easily modified without changing code but also with full
provenance to understand what environment and parameters were used to produce any dataset. It
provides flexible, high-performance, low-overhead persistence and retrieval of datasets with data
repositories and formats selected by external parameters rather than hard-coding. Middleware services
enable efficient, managed replication of data over both wide area networks and local area networks.
2 Introduction
This document describes the baseline design of the LSST data access and processing middleware,
including the following elements of the Data Management (DM) Construction Work Breakdown
Structure (WBS):





02C.06.02.01 Database and File Access Services
02C.07.01.01 Control and Management Services
02C.07.01.02 Pipeline Construction Toolkit
02C.07.01.03 Pipeline Execution Services
02C.07.01.07 File System Services
The LSST database design, which contributes to WBS element 02C.06.02.01 and other elements within
02C.06.02, may be found in the document entitled “Data Management Database Design” (LDM-135).
WBS element 02C.07.04 is described in “LSST Cybersecurity Plan” (LSE-99). 02C.07.05 (visualization) and
02C.07.06 (system administration) are primarily low-level, off-the-shelf tools and are not described
further here. 02C.07.08 (environment and tools) includes similar off-the-shelf tools as well as testbeds
and other primarily-hardware elements.
1
LSST Data Management Middleware Design
LSE-152
07/25/2011
Figure 1. Data Management System Layers.
Common to all aspects of the middleware design is an emphasis on flexibility through the use of
abstract, pluggable interfaces controlled by managed, user-modifiable parameters. In addition, the
substantial computational and bandwidth requirements of the LSST Data Management System (DMS)
force the designs to be conscious of performance, scalability, and fault tolerance. In most cases, the
middleware does not require advances over the state of the art; instead, it requires abstraction to allow
for future technological change and aggregation of tools to provide the necessary features.
3 02C.06.02.01 Database and File Access Services
This WBS element contains the I/O layer of the Data Access Framework (DAF).
3.1 Data Access Framework I/O Layer
This layer provides access to local resources (within a data access center, for example) and
low-performance access to remote resources. These resources may include images, non-image files, and
databases. Bulk data transfers over the wide-area network (WAN) and high-performance access to
remote resources are provided by the replication layer within 02C.07.07 File System Services.
3.1.1
Key Requirements
The DAF I/O layer must provide persistence and retrieval capabilities to application code. Persistence is
the mechanism by which application objects are written to files in some format or a database or a
2
LSST Data Management Middleware Design
LSE-152
07/25/2011
combination of both; retrieval is the mechanism by which data in files or a database or a combination of
both is made available to application code in the form of an application object. Persistence and retrieval
must be low-overhead, allowing efficient use of available bandwidth. The interface to the I/O layer must
be usable by application developers. It is required to be flexible, allowing changes in file formats or even
whether a given object is stored in a file or the database to be selected at runtime in a controlled
manner. Image data must be able to be stored in standard FITS format, although the metadata for the
image may be in either FITS headers or database table entries.
3.1.2
Baseline Design
We designed the I/O layer to provide access to datasets. A dataset is a logical grouping of data that is
persisted or retrieved as a unit, typically corresponding to a single programming object or a collection of
objects. Dataset types are predefined. Datasets are identified by a unique identifier. Datasets may be
persisted into multiple formats.
A Formatter subclass is responsible for converting the in-memory version of an object to its persisted
form (or forms), represented by a Storage subclass, and vice versa. The Storage interface may be thin
(e.g. providing a file's pathname) or thick (e.g. providing an abstract database interface) depending on
the complexity of the persisted format; all Formatters using a Storage are required to understand its
interface, but no application code need do so. One Storage will represent the publish/subscribe
interface used by the camera data acquisition system to deliver image data. A Storage is configured with
a LogicalLocation to indicate where the object resides. Formatters and Storages are looked up by name
at runtime, so they are fully pluggable. Formatters may make use of existing I/O libraries such as cfitsio,
in which case they are generally thin wrappers. Formatters are configured by Policies.
All persistence and retrieval is performed under the control of a Persistence object. This object is
responsible for interpreting the overall persistence Policy, managing the lookups and invocations of
Formatters and Storages, and ensuring that any transaction/rollback handling is done correctly.
3
LSST Data Management Middleware Design
LSE-152
07/25/2011
class Persistence Design Pattern
Persistence
Persists
Persistence Policy
Persistable
PersistenceStorage
«entity»
Policy
Storage
A
PersistenceFormat
LogicalLocation
Formatter
Figure 2. Data Access Framework I/O Layer Components.
3.1.3
Alternatives Considered
Use of a full-fledged object-relational mapping system for output to a database was considered but
determined to be too heavyweight and intrusive.
3.1.4
Prototype Implementation
A C++ implementation of the design was created for Data Challenge 2 (DC2) and has evolved since then.
Formatters for images and exposures, sources and objects, and PSFs have been created. Datasets are
identified by URLs. Storage classes include FITS1 files, Boost::serialization2 streams (native and XML),
and the MySQL3 database (via direct API calls or via an intermediate, higher-performance, bulk-loaded
tab-separated value file). The camera interface has not yet been prototyped.
This implementation has been extended in DC3 to include a Python-based version of the same design
that uses the C++ implementation internally. In the Python version, a Data Butler plays the role of the
Persistence object. It takes dataset identifiers that are composed of key/value pairs, with the ability to
infer missing values as long as those provided are unique. An internal Mapper class uses a Policy to
control the format and location for each dataset. A Python-only Storage class has been added to allow
persistence via the Python "pickle" mechanism.
1
2
3
http://fits.gsfc.nasa.gov/
http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/index.html
http://www.mysql.com/
4
LSST Data Management Middleware Design
LSE-152
07/25/2011
4 02C.07.01.01 Control and Management Services
4.1 Event Subsystem
The event subsystem is used to communicate among components of the DM System, including between
pipelines in a production. A monitoring component of the subsystem can execute rules based on
patterns of events, enabling fault detection and recovery.
4.1.1
Key Requirements
The event subsystem must reliably transfer events from source to multiple destinations. There must be
no central point of failure. The subsystem must be scalable to handle high volumes of messages, up to
tens of thousands per second. It must interface to languages including Python and C++.
A monitoring component must be able to detect the absence of messages within a given time window
and the presence of messages (such as logged exceptions) defined by a pattern.
4.1.2
Baseline Design
The subsystem will be built as a wrapper over a reliable messaging system such as Apache ActiveMQ4.
Event subclasses and standardized metadata will be defined in C++ and wrapped using SWIG5 to make
them accessible from Python. Events will be published to a topic; multiple receivers may subscribe to
that topic to receive copies of the events.
The event monitor subscribes to topics that indicate faults or other system status. It can match
templates to events, including boolean expressions and time expressions applied to event data and
metadata.
Figure 3. Event Subsystem Components
4
5
http://activemq.apache.org
http://www.swig.org
5
LSST Data Management Middleware Design
LSE-152
07/25/2011
Observatory Control System (OCS) messages destined for the DM System will be translated into DM
Event Subsystem events by dedicated software (part of the DMCS, see section 4.3) and published to
appropriate topics.
4.1.3
Prototype Implementation
An implementation of the event subsystem on Apache ActiveMQ was created for DC2 and has evolved
since then. Command, Log, Monitor, PipelineLog, and Status event types have been defined. Event
receivers include pipeline components, orchestration components, the event monitor, and a logger that
inserts entries into a database.
The event monitor has been prototyped in Java. The OCS message translator has not yet been
prototyped.
4.2 Orchestration
The orchestration layer is responsible for deploying pipelines and Policies onto nodes, ensuring that
their input data is staged appropriately, distributing dataset identifiers to be processed, recording
provenance, and actually starting pipeline execution.
4.2.1
Key Requirements
The orchestration layer must be able to deploy pipelines and their associated configuration Policies onto
one or more nodes in a cluster. Different pipelines may be deployed to different, although possibly
overlapping, subsets of nodes. All four pipeline execution models (see section 5.2.1) must be supported.
Sufficient provenance information must be captured to ensure that datasets can be reproduced from
their inputs.
The orchestration layer at the Base Center works with the DM Control System (DMCS, see section 4.3) at
that Center to accept commands from the OCS to enter various system modes such as Nightly Observing
or Daytime Calibration. The DMCS invokes the orchestration layer to configure and deploy the Alert
Production pipelines accordingly. At the Archive Center, the orchestration layer manages execution of
the Data Release Production, including sequencing scans through the raw images in spatial and temporal
order.
Orchestration must detect failures, categorize them as permanent or possibly-transient, and restart
transiently-failed processing according to the appropriate fault tolerance strategy.
4.2.2
Baseline Design
The design for the orchestration layer is a pluggable, Policy-controlled framework. Plug-in modules are
used to configure and deploy pipelines on a variety of underlying process management technologies
6
LSST Data Management Middleware Design
LSE-152
07/25/2011
(such as simple ssh6 or more complex Condor-G7 glide-ins), which is necessary during design and
development when hardware is typically borrowed rather than owned. Additional modules capture
hardware, software, and configuration provenance, including information about the execution nodes,
the versions of all software packages, and the values of all configuration parameters for both
middleware and applications.
This layer monitors the availability of datasets and can trigger the execution of pipelines when their
inputs become available. It can hand out datasets to pipelines based on the history of execution and the
availability of locally-cached datasets to minimize data movement.
Faults are detected by the pipeline harness and event monitor timeouts. Orchestration then reprocesses
transiently-failed datasets on already-deployed pipelines or on new pipeline instances that it deploys for
the purpose.
4.2.3
Prototype Implementation
A prototype implementation of the deployment framework was developed for DC3a. It was extended to
handle Condor-G, and data dependency features were added for DC3b. Full fault tolerance has not yet
been prototyped, although a limited application of a fault tolerance strategy has been demonstrated.
Provenance is recorded in files and, to a limited extent, in a database. The file-based provenance has
been demonstrated to be sufficient to regenerate datasets.
4.3 Data Management Control System
The LSST DMS at each center will be monitored and controlled by a Data Management Control System
(DMCS).
4.3.1
Key Requirements
The DMCS at each site is responsible for initializing and running diagnostics on all equipment, including
computing nodes, disk storage, tape storage, and networking. It establishes and maintains connectivity
with the other sites including the Headquarters Site. It monitors the operation of all hardware and
integrates with the orchestration layer (see section 4.2) to monitor software execution. System status
and control functions will be available via a Web-enabled tool to the Headquarters Site and remote
locations.
At the Base Center, the DMCS is responsible for interfacing with the OCS (as defined in “Control System
Interfaces between the Telescope & Data Management”, Document LSE-75). It accepts commands from
the OCS to enter various modes, including observing, calibration, day, maintenance, and shutdown. It
then configures and invokes the orchestration layer and the replication layer (see section 7.1) to enable
6
7
http://openssh.com/
http://www.cs.wisc.edu/condor/condorg/
7
LSST Data Management Middleware Design
LSE-152
07/25/2011
the necessary processing and data movement for each mode, including running the Alert Production and
replicating raw images to the Archive Center, respectively.
At the Archive Center, the DMCS performs resource management for the compute cluster. Parts of the
cluster may be dedicated to certain activities while others operate in a shared mode. The major
processing activities under DMCS control, invoked using the orchestration layer, include the Alert
Production reprocessing (on dedicated hardware), the Calibration Products Production, and the Data
Release Production. The DMCS also initializes the replication layer to enable the archiving of raw images
received from the Base Site.
At each Data Access Center, the DMCS performs resource management for the level 3 data products
compute cluster. It also initializes the replication layer to enable the distribution of level 1 data products
received from the Base Center or the Archive Center.
4.3.2
Baseline Design
The DMCS will consist of an off-the-shelf cluster management system together with a custom pluggable
software framework. A Web-based control panel and an off-the-shelf monitoring system will also be
integrated. Plugins will include job management systems like Condor, mode transition scripts to
interface with the OCS and control panel, and hardware-specific initialization and configuration
software.
5 02C.07.01.02 Pipeline Construction Toolkit
5.1 Policy Framework
The Policy component of the Pipeline Framework is of key importance throughout the LSST middleware.
Policies are a mechanism to specify parameters for applications and middleware in a consistent,
managed way. The use of Policies facilitates runtime reconfiguration of the entire system while still
ensuring consistency and the maintenance of traceable provenance.
5.1.1
Key Requirements
Policies must be able to contain parameters of various types, including at least strings, booleans,
integers, and floating-point numbers. Ordered lists of each of these must also be supported. Each
parameter must have a name. A hierarchical organization of names is required so that all parameters
associated with a given component may be named and accessed as a group.
There must be a facility to specify legal and required parameters and their types and to use this
information to ensure that invalid parameters are detected before code attempts to use them. Default
values for parameters must be able to be specified; it must also be possible to override those default
values, potentially multiple times (with the last override controlling).
8
LSST Data Management Middleware Design
LSE-152
07/25/2011
Policies and their parameters must be stored in a user-modifiable form. It is preferable for this form to
be textual so that it is human-readable and modifiable using an ordinary text editor.
It must be possible to save sufficient information about a Policy to obtain the value of any of its
parameters as seen by the application code.
5.1.2
Baseline Design
The design follows straightforwardly from the requirements.
Policies are specified by a text file containing hierarchically-organized name/value pairs. A value may be
another Policy (referred to as a sub-Policy). A value may also be a list of values (all of the same type).
Policies may reference other Policies to set values for sub-Policies.
A Dictionary, which is also a Policy, specifies the legal parameter names, their types, minimum and
maximum lengths for list values, and whether a parameter is required. Since Dictionaries are Policies,
they may use Policy references to incorporate other dictionaries to validate sub-Policies.
Each piece of application code (routine or object) using a Policy will typically have an associated
Dictionary to validate the Policy parameters and provide default values. Default values may also be
provided by the code’s caller, adding to or overriding the Dictionary defaults.
With text-file Policies, the complete parameter state of a given execution may be preserved by
preserving all the text files. In addition, the simple hierarchical syntax lends itself to storage in a
database as a key/value table with dotted-name keys, allowing queries of the parameters by name
(including the use of regular expressions) and value.
5.1.3
Prototype Implementation
An implementation of Policy using a simple "name: value" syntax with brace-delimited sub-Policies has
been in use since DC2. Hierarchical names are specified using dotted-path notation. XML syntax was
considered but determined to be too wordy and difficult to edit. The dotted-path notation does not
currently support referring to individual list elements.
Dictionaries have been implemented with validation for fixed parameter names. Extending this
validation to variable parameter names (e.g. for parameters pertaining to pluggable measurement
algorithms) has not yet been implemented. Automatic merging of overrides and validation of the result
is also currently unimplemented; instead, application code must merge default values into an incoming
Policy using an API call.
Inter-Policy references are implemented using pathnames or references that can locate Policies with
respect to their containing software packages.
Provenance code can load a complete set of Policies into a set of database tables for querying. Loading
of simple lists of values is supported, but loading of lists of sub-Policies has not yet been implemented.
9
LSST Data Management Middleware Design
LSE-152
07/25/2011
5.2 Pipeline Harness
A pipeline is a very common representation of astronomical processing. Datasets are processed by a
series of components in turn. Each component applies an algorithm to one or more input datasets,
producing one or more outputs that are handed to the next component. The pipeline harness provides
the ability to create these pipelines.
5.2.1
Key Requirements
The pipeline harness must allow components to be specified in Python. It must handle the transfer of
datasets from component to component. To ensure adequate performance for the Alert Production,
such data transfer must be possible in memory, not solely through disk files. Pipeline components must
be able to report errors and thereby prevent the execution of downstream components.
The pipeline harness must support execution in four modes:




5.2.2
Single task (serial mode). One pipeline instance executes on one dataset. This mode is useful for
development, testing, and debugging.
Single task (parallel mode). Multiple linked pipeline instances execute on multiple datasets
belonging to a single task while communicating amongst themselves and synchronizing when
appropriate. This mode is required for real-time alert processing.
Multiple tasks (batch mode). Multiple pipeline instances execute on one dataset each.
Instances are independent of each other except that an instance may not be executed until all of
its inputs are available. Instances may be executing different code to perform different tasks.
This mode is required for some types of Data Release processing.
Multiple tasks (data-sensitive mode). Multiple long-lived pipeline instances execute on multiple
datasets, with dataset assignments to pipelines depending on past history, enabling
repeatedly-used data to be kept in memory or at least on the node. Instances may again be
executing different code for different tasks. This mode is required for some types of
data-intensive Data Release processing.
Baseline Design
The pipeline harness is comprised of Pipeline, Slice, and Stage objects. A Pipeline is responsible for
starting, stopping, and synchronizing Slices. Each Slice is a linear, unbranched sequence of Stages; all
Slices run in parallel. The configuration of each Stage is controlled by a Policy. Each Stage wraps a single
algorithmic process. Stages may pass in-memory objects to downstream Stages.
10
LSST Data Management Middleware Design
LSE-152
07/25/2011
Figure 4. Pipeline Harness Components.
5.2.3
Prototype Implementations
One implementation of the harness design has been developed in C++ and Python. Slices can
communicate with each other via a thread model or via MPI8, or none at all. The sequence of Stages is
defined by a Policy that also contains the Stage configuration Policies. Stages receive data via an
in-memory Clipboard that contains name/value pairs; they process this data and place their output (or
transformed input) onto the Clipboard for the next Stage. Items on the Clipboard may be transmitted
from one Slice to another using the inter-Slice communication mechanism; Slices are addressed by
topological labels.
Two Stages have been implemented to interface with the Data Access Framework I/O layer (see section
3.1) to persist and retrieve datasets. An additional stage can be used to send Events (see section 4.1) to
other pipelines or to the orchestration layer. Other stages interface with the orchestration layer (see
section 4.2) via Events to define the datasets to be operated on and report errors in the pipeline, which
are transmitted from the Stage to the Pipeline via a Python exception.
Another implementation has been developed in Python. This implementation is currently limited to the
single-task serial and multiple-task batch modes of operation with one Slice; it is primarily intended for
development and debugging purposes. It can be extended to use thread-based or MPI-based
communication in the future. Stages are implemented by Python classes; the sequence of Stages is
specified by a Python script rather than a Policy, adding more programmability. Stages pass data
through in-memory Python variables. Direct calls to the Data Access Framework are made to persist and
retrieve datasets, and errors are reported through normal Python exceptions.
8
http://mpi-forum.org
11
LSST Data Management Middleware Design
LSE-152
07/25/2011
6 02C.07.01.03 Pipeline Execution Services
6.1 Logging Subsystem
The logging subsystem is used by application and middleware code to record status and debugging
information.
6.1.1
Key Requirements
Log messages must be associated with component names organized hierarchically. Logging levels
controlling which messages are produced must be configurable on a per-component level. There must
be a way for messages that are not produced to not add overhead. Logs must be able to be written to
local disk files as well as sent via the event subsystem. Metadata about a component's context, such as a
description of the CCD being processed, must be able to be attached to a log message.
6.1.2
Baseline Design
Log objects are created in a parent/child hierarchy and associated with dotted-path names; each such
Log and name has an importance threshold associated with it. Methods on the Log object are used to
record log messages. One such method uses the C++ varargs functionality to avoid formatting the
message until it has been determined if the importance meets the threshold. Log messages are
contained within LogRecords that have additional key/value contextual metadata.
Multiple LogDestination streams can be created and attached to Logs (and inherited in child Logs). Each
such stream has its own importance threshold. LogRecords may also be formatted in different ways
depending on the LogDestination. LogRecords may also incorporated into Events (see section 4.1) and
transmitted on a topic.
Two sets of wrappers around the basic Log objects simplify logging start/stop timing messages and allow
debug messages to be compiled out.
6.1.3
Prototype Implementation
A prototype implementation was created in C++ for DC2; the debugging and logging components of that
implementation were merged for DC3a. The C++ interface is wrapped by SWIG into Python.
7 02C.07.01.07 File System Services
7.1 Data Access Framework Replication Layer
The DAF replication layer moves data from site to site over the WAN, including providing for image
regeneration and caching.
12
LSST Data Management Middleware Design
LSE-152
07/25/2011
7.1.1
Key Requirements
The replication layer must be able to reliably and rapidly move files from one site to another. It must be
fully automatable and monitorable, and it must scale to the billions of files that the DMS will contain. It
must be able to drive the retrieval of archived raw images, their processing into calibrated science
images (CSIs), the caching of the resulting images, and the cutting out of regions from those images for
access by remote users.
7.1.2
Baseline Design
The iRODS9 rule-based trigger mechanisms and its pluggable micro-services capability are ideal for
handling inter-site transfer tasks. The iRODS feature set matches well with our requirements.
Accordingly, it is the baseline for this layer.
7.1.3
Prototype Implementation
We tested iRODS in 2006 to assess its bandwidth efficiency and ability to sustain high transfer rates. The
test successfully transferred image data at the rate of 6 TB/day (more than 10% of the LSST
base-to-archive average rate). We also attempted to use iRODS to transfer image simulation results,
learning that improper configuration could cause performance and usability difficulties.
We are continuing to experiment with packages that could also meet our requirements such as the
REDDnet10 distributed storage facility, the Xrootd11 scalable data access system, and other wide-area
parallel filesystems.
9
http://www.irods.org/
http://www.reddnet.org/
11 http://xrootd.slac.stanford.edu/
10
13
Download