LSST Science and Project Sizing Inputs Explanation

advertisement
LSST Data Management Middleware Design
LDM-152
7/25/2011
Large Synoptic Survey Telescope (LSST)
Data Management Middleware
Design
Kian-Tat Lim, Ray Plante, and Gregory Dubois-Felsmann
LDM-152
July 25, 2011
The contents of this document are subject to configuration control and may not be changed, altered, or their
provisions waived without prior approval of the LSST Change Control Board.
1
LSST Data Management Middleware Design
LDM-152
Change Record
Version
1
Date
7/25/2011
Description
Initial version based on pre-existing UML models
and presentations
i
Owner name
Kian-Tat Lim
7/25/2011
LSST Data Management Middleware Design
LDM-152
7/25/2011
Table of Contents
Change Record ............................................................................................................................................... i
1
Executive Summary .............................................................................................................................. 1
2
Introduction ........................................................................................................................................... 1
3
02C.06.02.01 Database and File Access Services................................................................................. 1
3.1
4
5
6
02C.07.01.01 Control and Management Services ................................................................................. 3
4.1
Event Subsystem ........................................................................................................................... 3
4.2
Orchestration ................................................................................................................................. 4
02C.07.01.02 Pipeline Construction Toolkit ........................................................................................ 5
5.1
Policy Framework ......................................................................................................................... 5
5.2
Pipeline Harness............................................................................................................................ 7
02C.07.01.03 Pipeline Execution Services ........................................................................................... 8
6.1
7
Logging Subsystem ....................................................................................................................... 8
02C.07.01.06 System Administration and Operations Services ........................................................... 9
7.1
8
Data Access Framework I/O Layer ............................................................................................... 2
Data Management Control System ............................................................................................... 9
02C.07.01.07 File System Services .................................................................................................... 10
8.1
Data Access Framework Replication Layer ................................................................................ 10
ii
LSST Data Management Middleware Design
LDM-152
7/25/2011
The LSST Data Management Middleware Design
1 Executive Summary
The LSST middleware is designed to isolate scientific applications, including the Alert Production, Data
Release Production, Calibration Products Production, and Level 3 processing, from details of the
underlying hardware and system software. It enables flexible reuse of the same code in multiple
environments ranging from offline laptops to shared-memory multiprocessors to grid-accessed clusters,
with a common communication and logging model. It ensures that key scientific and deployment
parameters controlling execution can be easily modified without changing code but also with full
provenance to understand what environment and parameters were used to produce any dataset. It provides
flexible, high-performance, low-overhead persistence and retrieval of datasets with data repositories and
formats selected by external parameters rather than hard-coding. Middleware services enable efficient,
managed replication of data over both wide area networks and local area networks.
2 Introduction
This document describes the baseline design of the LSST middleware, including the following elements
of the Data Management (DM) Construction Work Breakdown Structure (WBS):






02C.06.02.01 Database and File Access Services
02C.07.01.01 Control and Management Services
02C.07.01.02 Pipeline Construction Toolkit
02C.07.01.03 Pipeline Execution Services
02C.07.01.06 System Administration and Operations Services
02C.07.01.07 File System Services
The LSST database design, which contributes to WBS element 02C.06.02.01, may be found in the
document entitled “Data Management Database Design” (LDM-135).
Common to all aspects of the middleware design is an emphasis on flexibility through the use of abstract,
pluggable interfaces controlled by managed, user-modifiable parameters. In addition, the substantial
computational and bandwidth requirements of the LSST Data Management System (DMS) force the
designs to be conscious of performance, scalability, and fault tolerance. In most cases, the middleware
does not require advances over the state of the art; instead, it requires abstraction to allow for future
technological change and aggregation of tools to provide the necessary features.
3 02C.06.02.01 Database and File Access Services
This WBS element contains the I/O layer of the Data Access Framework (DAF).
1
LSST Data Management Middleware Design
LDM-152
7/25/2011
3.1 Data Access Framework I/O Layer
This layer provides access to local resources (within a data access center, for example) and lowperformance access to remote resources. These resources may include images, non-image files, and
databases. Bulk data transfers over the wide-area network (WAN) and high-performance access to remote
resources are provided by the replication layer within 02C.07.07 File System Services.
3.1.1
Key Requirements
The DAF I/O layer must provide persistence and retrieval capabilities to application code. Persistence is
the mechanism by which application objects are written to files in some format or a database or a
combination of both; retrieval is the mechanism by which data in files or a database or a combination of
both is made available to application code in the form of an application object. Persistence and retrieval
must be low-overhead, allowing efficient use of available bandwidth. The interface to the I/O layer must
be usable by application developers. It is required to be flexible, allowing changes in file formats or even
whether a given object is stored in a file or the database to be selected at runtime in a controlled manner.
Image data must be able to be stored in standard FITS format, although the metadata for the image may
be in either FITS headers or database table entries.
3.1.2
Baseline Design
We designed the I/O layer to provide access to datasets. A dataset is a logical grouping of data that is
persisted or retrieved as a unit, typically corresponding to a single programming object or a collection of
objects. Dataset types are predefined. Datasets are identified by a unique identifier. Datasets may be
persisted into multiple formats.
A Formatter subclass is responsible for converting the in-memory version of an object to its persisted
form (or forms), represented by a Storage subclass, and vice versa. The Storage interface may be thin (e.g.
providing a file's pathname) or thick (e.g. providing an abstract database interface) depending on the
complexity of the persisted format; all Formatters using a Storage are required to understand its interface,
but no application code need do so. One Storage will represent the publish/subscribe interface used by the
camera data acquisition system to deliver image data. Formatters and Storages are looked up by name at
runtime, so they are fully pluggable. Formatters may make use of existing I/O libraries such as cfitsio, in
which case they are generally thin wrappers. Formatters are configured by Policies.
All persistence and retrieval is performed under the control of a Persistence object. This object is
responsible for interpreting the overall persistence Policy, managing the lookups and invocations of
Formatters and Storages, and ensuring that any transaction/rollback handling is done correctly.
3.1.3
Alternatives Considered
Use of a full-fledged object-relational mapping system for output to a database was considered but
determined to be too heavyweight and intrusive.
2
LSST Data Management Middleware Design
3.1.4
LDM-152
7/25/2011
Prototype Implementation
A C++ implementation of the design was created for Data Challenge 2 (DC2) and has evolved since then.
Formatters for images and exposures, sources and objects, and PSFs have been created. Datasets are
identified by URLs. Storage classes include FITS1 files, Boost::serialization2 streams (native and XML),
and the MySQL3 database (via direct API calls or via an intermediate, higher-performance, bulk-loaded
tab-separated value file). The camera interface has not yet been prototyped.
This implementation has been extended in DC3 to include a Python-based version of the same design that
uses the C++ implementation internally. In the Python version, a Data Butler plays the role of the
Persistence object. It takes dataset identifiers that are composed of key/value pairs, with the ability to
infer missing values as long as those provided are unique. An internal Mapper class uses a Policy to
control the format and location for each dataset. A Python-only Storage class has been added to allow
persistence via the Python "pickle" mechanism.
4 02C.07.01.01 Control and Management Services
4.1 Event Subsystem
The event subsystem is used to communicate among components of the DM System, including between
pipelines in a production. A monitoring component of the subsystem can execute rules based on patterns
of events, enabling fault detection and recovery.
4.1.1
Key Requirements
The event subsystem must reliably transfer events from source to multiple destinations. There must be no
central point of failure. The subsystem must be scalable to handle high volumes of messages, up to tens of
thousands per second. It must interface to languages including Python and C++.
A monitoring component must be able to detect the absence of messages within a given time window and
the presence of messages (such as logged exceptions) defined by a pattern.
4.1.2
Baseline Design
The subsystem will be built as a wrapper over a reliable messaging system such as Apache ActiveMQ4.
Event subclasses and standardized metadata will be defined in C++ and wrapped using SWIG5 to make
them accessible from Python. Events will be published to a topic; multiple receivers may subscribe to that
topic to receive copies of the events.
1
http://fits.gsfc.nasa.gov/
http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/index.html
3 http://www.mysql.com/
4 http://activemq.apache.org
5 http://www.swig.org
2
3
LSST Data Management Middleware Design
LDM-152
7/25/2011
The event monitor subscribes to topics that indicate faults or other system status. It can match templates
to events, including boolean expressions and time expressions applied to event data and metadata.
Observatory Control System (OCS) messages destined for the DM System will be translated into DM
Event Subsystem events by dedicated software and published to appropriate topics.
4.1.3
Prototype Implementation
An implementation of the event subsystem on Apache ActiveMQ was created for DC2 and has evolved
since then. Command, Log, Monitor, PipelineLog, and Status event types have been defined. Event
receivers include pipeline components, orchestration components, the event monitor, and a logger that
inserts entries into a database.
The event monitor has been prototyped in Java. The OCS message translator has not yet been prototyped.
4.2 Orchestration
The orchestration layer is responsible for deploying pipelines and Policies onto nodes, ensuring that their
input data is staged appropriately, distributing dataset identifiers to be processed, recording provenance,
and actually starting pipeline execution.
4.2.1
Key Requirements
The orchestration layer must be able to deploy pipelines and their associated configuration Policies onto
one or more nodes in a cluster. Different pipelines may be deployed to different, although possibly
overlapping, subsets of nodes. All four pipeline execution models (see section 5.2.1) must be supported.
Sufficient provenance information must be captured to ensure that datasets can be reproduced from their
inputs.
The orchestration layer at the Base Center works with the DM Control System (DMCS, see section 7.1) at
that Center to accept commands from the OCS to enter various system modes such as Nightly Observing
or Daytime Calibration. The DMCS invokes the orchestration layer to configure and deploy the Alert
Production pipelines accordingly. At the Archive Center, the orchestration layer manages execution of the
Data Release Production, including sequencing scans through the raw images in spatial and temporal
order.
Orchestration must detect failures, categorize them into permanent and possibly-transient, and restart
transiently-failed processing according to the appropriate fault tolerance strategy.
4.2.2
Baseline Design
The design for the orchestration layer is a pluggable, Policy-controlled framework. Plug-in modules are
used to configure and deploy pipelines on a variety of underlying process management technologies (such
4
LSST Data Management Middleware Design
LDM-152
7/25/2011
as simple ssh6 or more complex Condor-G7 glide-ins), which is necessary during design and development
when hardware is typically borrowed rather than owned. Additional modules capture hardware, software,
and configuration provenance, including information about the execution nodes, the versions of all
software packages, and the values of all configuration parameters for both middleware and applications.
This layer monitors the availability of datasets and can trigger the execution of pipelines when their inputs
become available. It can hand out datasets to pipelines based on the history of execution and the
availability of locally-cached datasets to minimize data movement.
Faults are detected by the pipeline harness and event monitor timeouts. Orchestration then reprocesses
transiently-failed datasets on already-deployed pipelines or else deploys a new pipeline instance for the
reprocessing.
4.2.3
Prototype Implementation
A prototype implementation of the deployment framework was developed for DC3a. It was extended to
handle Condor-G, and data dependency features were added for DC3b. Full fault tolerance has not yet
been prototyped, although a limited application of a fault tolerance strategy has been demonstrated.
Provenance is recorded in files and, to a limited extent, in a database. The file-based provenance has been
demonstrated to be sufficient to regenerate datasets.
5 02C.07.01.02 Pipeline Construction Toolkit
5.1 Policy Framework
The Policy component of the Pipeline Framework is of key importance throughout the LSST middleware.
Policies are a mechanism to specify parameters for applications and middleware in a consistent, managed
way. The use of Policies facilitates runtime reconfiguration of the entire system while still ensuring
consistency and the maintenance of traceable provenance.
5.1.1
Key Requirements
Policies must be able to contain parameters of various types, including at least strings, booleans, integers,
and floating-point numbers. Ordered lists of each of these must also be supported. Each parameter must
have a name. A hierarchical organization of names is required so that all parameters associated with a
given component may be named and accessed as a group.
There must be a facility to specify legal and required parameters and their types and to use this
information to ensure that invalid parameters are detected before code attempts to use them. Default
6
7
http://openssh.com/
http://www.cs.wisc.edu/condor/condorg/
5
LSST Data Management Middleware Design
LDM-152
7/25/2011
values for parameters must be able to be specified; it must also be possible to override those default
values, potentially multiple times (with the last override controlling).
Policies and their parameters must be stored in a user-modifiable form. It is preferable for this form to be
textual so that it is human-readable and modifiable using an ordinary text editor.
It must be possible to save sufficient information about a Policy to obtain the value of any of its
parameters as seen by the application code.
5.1.2
Baseline Design
The design follows straightforwardly from the requirements.
Policies are specified by a text file containing hierarchically-organized name/value pairs. A value may be
another Policy (referred to as a sub-Policy). A value may also be a list of values (all of the same type).
Policies may reference other Policies to set values for sub-Policies.
A Dictionary, which is also a Policy, specifies the legal parameter names, their types, minimum and
maximum lengths for list values, and whether a parameter is required. Since Dictionaries are Policies,
they may use Policy references to incorporate other dictionaries to validate sub-Policies.
Each piece of application code (routine or object) using a Policy will typically have an associated
Dictionary to validate the Policy parameters and provide default values. If a higher-level routine invokes a
lower-level Policy-using routine, it passes in the appropriate Policy, which may be a sub-Policy of the
higher-level routine's Policy. The higher-level routine may also provide defaults, adding to or overriding
the Dictionary defaults.
With text-file Policies, the complete parameter state of a given execution may be preserved by preserving
all the text files. In addition, the simple hierarchical syntax lends itself to storage in a database as a
key/value table with dotted-name keys, allowing queries of the parameters by name (including the use of
regular expressions) and value.
5.1.3
Prototype Implementation
An implementation of Policy using a simple "name: value" syntax with brace-delimited sub-Policies has
been in use since DC2. Hierarchical names are specified using dotted-path notation. XML syntax was
considered but determined to be too wordy and difficult to edit. The dotted-path notation does not
currently support referring to individual list elements.
Dictionaries have been implemented with validation for fixed parameter names. Extending this validation
to variable parameter names (e.g. for parameters pertaining to pluggable measurement algorithms) has not
yet been implemented. Automatic merging of overrides and validation of the result is also currently
unimplemented; instead, application code must merge default values into an incoming Policy using an
API call.
6
LSST Data Management Middleware Design
LDM-152
7/25/2011
Inter-Policy references are implemented using pathnames or references that can locate Policies with
respect to their containing software packages.
Provenance code can load a complete set of Policies into a set of database tables for querying. Loading of
simple lists of values is supported, but loading of lists of sub-Policies has not yet been implemented.
5.2 Pipeline Harness
A pipeline is a very common representation of astronomical processing. Datasets are processed by a series
of components in turn. Each component applies an algorithm to one or more input datasets, producing one
or more outputs that are handed to the next component. The pipeline harness provides the ability to create
these pipelines.
5.2.1
Key Requirements
The pipeline harness must allow components to be specified in Python. It must handle the transfer of
datasets from component to component. To ensure adequate performance for the Alert Production, such
data transfer must be possible in memory, not solely through disk files. Pipeline components must be able
to report errors and thereby prevent the execution of downstream components.
The pipeline harness must support execution in four modes:




5.2.2
Single task (serial mode). One pipeline instance executes on one dataset. This mode is useful for
development, testing, and debugging.
Single task (parallel mode). Multiple linked pipeline instances execute on multiple datasets
belonging to a single task while communicating amongst themselves and synchronizing when
appropriate. This mode is required for real-time alert processing.
Multiple tasks (batch mode). Multiple pipeline instances execute on one dataset each. Instances
are independent of each other except that an instance may not be executed until all of its inputs
are available. Instances may be executing different code to perform different tasks. This mode is
required for some types of Data Release processing.
Multiple tasks (data-sensitive mode). Multiple long-lived pipeline instances execute on multiple
datasets, with dataset assignments to pipelines depending on past history, enabling repeatedlyused data to be kept in memory or at least on the node. Instances may again be executing different
code for different tasks. This mode is required for some types of data-intensive Data Release
processing.
Baseline Design
The pipeline harness is comprised of Pipeline, Slice, and Stage objects. A Pipeline is responsible for
starting, stopping, and synchronizing Slices. Each Slice is a linear, unbranched sequence of Stages; all
Slices run in parallel. The configuration of each Stage is controlled by a Policy. Each Stage wraps a single
algorithmic process. Stages may pass in-memory objects to downstream Stages.
7
LSST Data Management Middleware Design
5.2.3
LDM-152
7/25/2011
Prototype Implementations
One implementation of the harness design has been developed in C++ and Python. Slices can
communicate with each other via a thread model or via MPI8, or none at all. The sequence of Stages is
defined by a Policy that also contains the Stage configuration Policies. Stages receive data via an inmemory Clipboard that contains name/value pairs; they process this data and place their output (or
transformed input) onto the Clipboard for the next Stage. Items on the Clipboard may be transmitted from
one Slice to another using the inter-Slice communication mechanism; Slices are addressed by topological
labels.
Two Stages have been implemented to interface with the Data Access Framework I/O layer (see section
3.1) to persist and retrieve datasets. An additional stage can be used to send Events (see section 4.1) to
other pipelines or to the orchestration layer. Other stages interface with the orchestration layer (see
section 4.2) via Events to define the datasets to be operated on and report errors in the pipeline, which are
transmitted from the Stage to the Pipeline via a Python exception.
Another implementation has been developed in Python. This implementation is currently limited to the
single-task serial and multiple-task batch modes of operation with one Slice; it is primarily intended for
development and debugging purposes. It can be extended to use thread-based or MPI-based
communication in the future. Stages are implemented by Python classes; the sequence of Stages is
specified by a Python script rather than a Policy, adding more programmability. Stages pass data through
in-memory Python variables. Direct calls to the Data Access Framework are made to persist and retrieve
datasets, and errors are reported through normal Python exceptions.
6 02C.07.01.03 Pipeline Execution Services
6.1 Logging Subsystem
The logging subsystem is used by application and middleware code to record status and debugging
information.
6.1.1
Key Requirements
Log messages must be associated with component names organized hierarchically. Logging levels
controlling which messages are produced must be configurable on a per-component level. There must be
a way for messages that are not produced to not add overhead. Logs must be able to be written to local
disk files as well as sent via the event subsystem. Metadata about a component's context, such as a
description of the CCD being processed, must be able to be attached to a log message.
8
http://mpi-forum.org
8
LSST Data Management Middleware Design
6.1.2
LDM-152
7/25/2011
Baseline Design
Log objects are created in a parent/child hierarchy and associated with dotted-path names; each such Log
and name has an importance threshold associated with it. Methods on the Log object are used to record
log messages. One such method uses the C++ varargs functionality to avoid formatting the message until
it has been determined if the importance meets the threshold. Log messages are contained within
LogRecords that have additional key/value contextual metadata.
Multiple LogDestination streams can be created and attached to Logs (and inherited in child Logs). Each
such stream has its own importance threshold. LogRecords may also be formatted in different ways
depending on the LogDestination. LogRecords may also incorporated into Events (see section 4.1) and
transmitted on a topic.
Two sets of wrappers around the basic Log objects simplify logging start/stop timing messages and allow
debug messages to be compiled out.
6.1.3
Prototype Implementation
A prototype implementation was created in C++ for DC2; the debugging and logging components of that
implementation were merged for DC3a. The C++ interface is wrapped by SWIG into Python.
7 02C.07.01.06 System Administration and Operations
Services
7.1 Data Management Control System
The LSST DMS at each center will be monitored and controlled by a Data Management Control System
(DMCS).
7.1.1
Key Requirements
The DMCS at each site is responsible for initializing and running diagnostics on all equipment, including
computing nodes, disk storage, tape storage, and networking. It establishes and maintains connectivity
with the other sites including the Headquarters Site. It monitors the operation of all hardware and
integrates with the orchestration layer (see section 4.2) to monitor software execution. System status and
control functions will be available via a Web-enabled tool to the Headquarters Site and remote locations.
At the Base Center, the DMCS is responsible for interfacing with the OCS (as defined in ICD ...). It
accepts commands from the OCS to enter various modes, including observing, calibration, day,
maintenance, and shutdown. It then configures and invokes the orchestration layer and the replication
layer (see section 8.1) to enable the necessary processing and data movement for each mode, including
running the Alert Production and replicating raw images to the Archive Center, respectively.
9
LSST Data Management Middleware Design
LDM-152
7/25/2011
At the Archive Center, the DMCS performs resource management for the compute cluster. Parts of the
cluster may be dedicated to certain activities while others operate in a shared mode. The major processing
activities under DMCS control, invoked using the orchestration layer, include the Alert Production
reprocessing (on dedicated hardware), the Calibration Products Production, and the Data Release
Production. The DMCS also initializes the replication layer to enable the archiving of raw images
received from the Base Site.
At each Data Access Center, the DMCS performs resource management for the level 3 data products
compute cluster. It also initializes the replication layer to enable the distribution of level 1 data products
received from the Base Center or the Archive Center.
7.1.2
Baseline Design
The DMCS will consist of an off-the-shelf cluster management system together with a custom pluggable
software framework. A Web-based control panel and an off-the-shelf monitoring system will also be
integrated. Plugins will include job management systems like Condor, mode transition scripts to interface
with the OCS and control panel, and hardware-specific initialization and configuration software.
8 02C.07.01.07 File System Services
8.1 Data Access Framework Replication Layer
The DAF replication layer moves data from site to site over the WAN, including providing for image
regeneration and caching.
8.1.1
Key Requirements
The replication layer must be able to reliably and rapidly move files from one site to another. It must be
fully automatable and monitorable, and it must scale to the billions of files that the DMS will contain. It
must be able to drive the retrieval of archived raw images, their processing into calibrated science images
(CSIs), the caching of the resulting images, and cutting out regions from those images for access by
remote users.
8.1.2
Baseline Design
The rule-based mechanisms within iRODS9 and its ability to use pluggable micro-services are ideal for
handling inter-site transfer tasks. The iRODS feature set matches well with our requirements.
Accordingly, it is the baseline for this layer.
9
http://www.irods.org/
10
LSST Data Management Middleware Design
8.1.3
LDM-152
7/25/2011
Prototype Implementation
We tested iRODS in 2006 to assess its bandwidth efficiency and ability to sustain high transfer rates. The
test successfully transferred image data at the rate of 6 TB/day (more than 10% of the LSST base-toarchive average rate). We also attempted to use iRODS to transfer image simulation results, learning that
improper configuration could cause performance and usability difficulties.
We are continuing to experiment with packages that could also meet our requirements such as the
REDDnet10 distributed storage facility, the Xrootd11 scalable data access system, and other wide-area
parallel filesystems.
10
11
http://www.reddnet.org/
http://xrootd.slac.stanford.edu/
11
Download