Automated Operation of the LSST Data Management

advertisement
Automated Operation of the LSST
Data Management System
*** DRAFT ***
LDM-230
Kian-Tat Lim, Gregory Dubois-Felsmann, Mario Juric, Dick Shaw, Jacek Becla,
and the LSST Data Management team
23 May 2013
Table of Contents
1
Introduction ......................................................................................................................................................2
2
Alert Production ...............................................................................................................................................3
2.1
Base DMCS and OCS Commandable Entities ............................................................................................5
2.1.1 init command .....................................................................................................................................7
2.1.2 configure command .........................................................................................................................7
2.1.3 enable command ................................................................................................................................8
2.1.4 disable command ..............................................................................................................................8
2.1.5 release command ..............................................................................................................................8
2.1.6 stop command .....................................................................................................................................8
2.1.7 abort command ..................................................................................................................................9
2.1.8 reset command ..................................................................................................................................9
2.1.9 startIntegration event ...............................................................................................................9
2.1.10 nextVisit event ..............................................................................................................................9
2.2
EFD replication....................................................................................................................................... 10
2.3
Alert Production Hardware ................................................................................................................... 10
2.3.1 Replicator.............................................................................................................................................. 12
2.3.2 Distributor ............................................................................................................................................ 13
2.3.3 Worker .................................................................................................................................................. 13
2.4
Catch-Up Archiver.................................................................................................................................. 15
2.5
Calibration image and engineering image modes ................................................................................. 15
1
2.6
Daytime DM operations mode .............................................................................................................. 16
2.7
Failure Modes ........................................................................................................................................ 16
2.8
Maintenance and Upgrades .................................................................................................................. 17
3
Calibration Products Production ................................................................................................................... 18
4
Data Release Production ............................................................................................................................... 18
5
6
4.1
Overall Sequence ................................................................................................................................... 19
4.2
Detailed Sequence ................................................................................................................................. 20
4.3
Parallelization ........................................................................................................................................ 21
4.4
Input and Output ................................................................................................................................... 22
4.5
Failure Modes ........................................................................................................................................ 22
4.6
Maintenance and Upgrades .................................................................................................................. 22
Data Access Center ........................................................................................................................................ 23
5.1
Databases .............................................................................................................................................. 23
5.2
Image Storage ........................................................................................................................................ 24
5.3
Level 3 Storage and Compute ................................................................................................................ 24
5.4
Failure Modes ........................................................................................................................................ 25
5.5
Maintenance and Upgrades .................................................................................................................. 25
Appendix: Abbreviations ............................................................................................................................... 26
1 Introduction
This document details the automated operations concept for the LSST Data Management System
(DMS). It describes how the various components of the application1, middleware2, and infrastructure3
layers of the DMS work together to enable generation, storage, and access for the Level 1, Level 2, and
Level 3 data products4). It specifies how processing and data will flow from one component to another.
There are four major parts within the DMS: the Alert Production and its associated Archivers, the
Calibration Products Production, the Data Release Production (DRP), and the Data Access Center
(DAC), which also provides facilities for Level 3 science processing.
1
Data Management Applications Design, LDM-151
Data Management Middleware Design, LDM-152
3
Daga Management Infrastructure Design, LDM-129
4
LSST Data Products Definition Document, LSE-TBD
2
2
These four parts are implemented across five major centers located at three sites: the Base Center and
Chilean Data Access Center (DAC) located at the AURA compound in La Serena, Chile; the Archive
Center and US DAC located at NCSA in Urbana-Champaign, Illinois, USA; and the French Center
located at CC-IN2P3 in Lyon, France. The DM system also communicates with the Camera and the
Observatory Control System located at the Summit Facility on Cerro Pachon, Chile.
Note that many of these operations rely on the LSST Observatory Network to transfer data and/or
control information. The operations specific to the network itself are not in the scope of this document;
they are covered in the LSST Network Operations and Management Plan (Document-11918). This
document presumes that the network is operating normally except where specifically called out.
2 Alert Production
The Alert Production's primary responsibilities are:
1. To archive all images from the Camera, including science images, wavefront sensor images,
calibration frames, and engineering images, to tape archives at both the Base and Archive
Centers (these are also replicated separately to the French Center),
2. To process these images to generate Level 1 data products, especially alerts indicating that
something has changed on the sky and orbits for Solar System objects, and
3. To provide image quality feedback to the Observatory Control System (OCS).
The science images include both crosstalk-corrected images that are used for immediate Level 1
processing and raw, uncorrected images that are permanently stored.
The Alert Production can be described from a “top-down” perspective, starting with the “commandable
entities”, which are software devices that the OCS can send commands to and receive status messages,
events, and telemetry from. It can also be described from a “bottom-up” perspective starting with the
physical machines used. Here, we start with the top-down view, going into more detail on the machines
and their operations afterwards.
For context, here are the basic functions of some of the Data Management (DM) infrastructure
components (see Figure 1):
1. “Replicator” computers at the Base that receive images from the Camera and associated
telemetry, transfer them to local storage, and send them over the wide-area network (WAN) to
the distributor machines at the Archive.
2. A network outage buffer at the Base that retains a copy of each image in non-volatile storage for
a limited time in case of WAN failure.
3. Tape archives at the Base and Archive that retain permanent copies of each image and other data
products.
4. Shared disk storage for inputs and Level 1 data products at the Chilean and US DACs.
5. “Distributor” computers at the Archive that receive images and telemetry from the replicator
3
machines and transfer them to local storage and the worker machines.
6. “Worker” computers at the Archive that perform the Alert Production computations.
7. Base and Archive DM Control Systems (DMCSs) running on one or more computers at each
location that control and monitor all processing.
8. A DM Event Services Broker running on one or more computers at the Archive that mediates all
DM Event Services messaging traffic.
9. A Calibration database at the US DAC that keeps information necessary to calibrate images.
10. Engineering and Facilities Database (EFD) replicas at the Chilean and US DACs and the French
Center that store all observatory commands and telemetry.
11. The Level 1 database at the Chilean and US DACs that stores the Level 1 catalog data products.
12. The Level 2 database at the US DAC that stores measurements of astronomical Objects.
13. An Alert Production control database at the Base that maintains records of all data transfer and
processing and is used by the Base DMCS.
4
Figure 1: Alert Production Hardware
2.1 Base DMCS and OCS Commandable Entities
The Alert Production hardware is divided into four commandable entities from the perspective of the
5
OCS:
1. Archiver: responsible for archiving images in real time.
2. Catch-Up Archiver: responsible for archiving images that did not get captured in real time due to
an outage of some part of the DM system.
3. EFD Replicator: responsible for replicating the EFD from the Summit to the Chilean DAC, the
US DAC, and the French Center.
4. Alert Production Cluster: responsible for generating Level 1 data products.
Each commandable entity can be commanded by the OCS to configure, enable, or disable
itself, along with obeying other generic OCS commands such as init, release, stop, and abort.
Each commandable entity publishes events and telemetry to the OCS for use by the observatory
operations staff.
All these commandable entities are implemented in the Base DMCS. They all run on a single machine,
which is the only one that communicates directly with the OCS. If it fails, as detected by heartbeat
monitoring, it is powered down and a spare machine is enabled at the same IP address, possibly
missing one or more visits.
The Base DMCS communicates with the OCS via the Data Distribution Service (DDS), through which
it receives commands according to a well-defined asynchronous command protocol5 and sends
command result messages, status updates, events, and telemetry. It should be noted that the
commandable entities do their processing while in the IDLE state from the perspective of the command
protocol.
The Base DMCS will be booted before the start of each night's observing to ensure that the system is in
a clean configuration. When the Base DMCS cold boots, the Base DMCS performs a self test sequence
to verify that it can communicate with the DM Event Services Broker (for DM-internal
communications) and the OCS (via DDS). After the self test sequence, the commandable entities start
up, in no particular defined configuration, and publish the OFFLINE state to the OCS.
The Base DMCS uses the Orchestration Manager (currently baselined to be implemented using
HTCondor6) to start jobs on the replicators, distributors, and workers. The Orchestration Manager may
run on the Base DMCS host or another machine.
The typical sequence of OCS commands after a cold boot will be init, configure, and enable
for each commandable entity.
5
Interface Control Document: LSST Observatory Control System Communication Architecture and Protocol, LSE-70;
Interface Control Document: OCS-Data Management Software Communication Interface, LSE-72; and Interface Support
Document: System Dictionary and Telemetry Streams, LSE-74.
6
http://research.cs.wisc.edu/htcondor/
6
2.1.1 init command
This instructs the OCS-visible commandable entity controlled by the Base DMCS to move from an
OFFLINE state to a normal commandable IDLE state. Successful completion requires that the Base
DMCS ensure that OCS global control is not locked out by DM engineering (e.g. software
installation, diagnostic tests, etc.).
2.1.2 configure command
This tells one of the OCS-visible commandable entities controlled by the Base DMCS to establish or
change its configuration. The configuration includes the set of computers to be used, the software to
be executed on them, and parameters used to control that software. There will be several standard
configurations used during operations (although each configuration will change with time); each
such configuration can be thought of as a mode of the corresponding DM commandable entities.
Some modes may apply to multiple commandable entities at the same time. Changing modes (by
reconfiguring the commandable entities) is expected to take from seconds to possibly a few minutes;
it is intended that mode changes may occur at any time and multiple times during a night.
Besides normal science observing mode, available configurations will include raw calibration image
and engineering image modes for the Archiver and Alert Production Cluster in which there are no
visits and different data products are generated. Another mode for the Alert Production Cluster will
be daytime DM operations (disconnected from the camera), in which the Alert Production Cluster
will be used to perform solar system object orbit-fitting and various daily maintenance and update
tasks and the Archiver is disabled or offline.
First, the Base DMCS verifies the command format and accepts the command. Then it checks that
the configuration is legal and consistent and that various prerequisites are met. When the check is
complete, the commandable entity is disabled (see the disable command in section 2.1.4 ), the
configuration is installed, and success is returned to the OCS. If the configuration is illegal or cannot
be installed properly, a command error (non-fatal) with failure reason is sent instead.
All of the commandable entities' configurations include the version of the software to be used. This
version must have already been installed on the participating machines. The presence of the
necessary software versions is checked by the Base DMCS in the Alert Production database (as
maintained by system management tools).
The Archiver's configuration prerequisite is that sufficient replicator/distributor pairs are available.
The Catch-Up Archiver's configuration prerequisite is that sufficient catch-up-dedicated
replicator/distributor pairs are available.
The Alert Production Cluster's prerequisites are that sufficient workers are available.
The EFD Replicator's prerequisite is that communication with the US DAC EFD replica is possible.
At the end of a configure command, the commandable entity is always disabled.
7
2.1.3 enable command
This command enables the commandable entity to run and process events and data. An enable
command is rejected if no configuration has been selected by a prior configure command to the
commandable entity.
Enabling the Archiver causes the Base DMCS to subscribe to the “startIntegration” event.
Enabling the Catch-Up Archiver allows it to scan for unarchived images to be handled and enables
the Orchestration Manager to schedule image archive jobs.
Enabling the Alert Production Cluster causes the Base DMCS to subscribe to the “nextVisit”
event in normal science mode; another event may be subscribed to in calibration or engineering
mode.
Enabling the EFD Replicator causes the Base DMCS to enable the US DAC and French Center EFD
replicas to be slaves to the Chilean DAC EFD replica.
2.1.4 disable command
This command disables the commandable entity from running and processing news events and data.
Disabling the Archiver causes it to unsubscribe from the “startIntegration” event. It does
not terminate any replicator jobs already executing.
Disabling the Catch-Up Archiver stops it from scanning for unarchived images and tells the
Orchestration Manager to stop scheduling any new image archive jobs.
Disabling the Alert Production Cluster causes it to unsubscribe from the “nextVisit” event. It
does not terminate any worker jobs already executing. In particular, the processing for the current
visit (not just exposure) will normally complete.
Disabling the EFD Replicator causes the Base DMCS to disable the slave operation of the US DAC
and French Center EFD replicas.
2.1.5 release command
This is the equivalent of a disable command, but the commandable entity goes to the OFFLINE
state.
2.1.6 stop command
If issued during a configure command, this command causes the commandable entity to go into
the no configuration state.
If issued during any other command, this command is ignored.
8
2.1.7 abort command
If issued during a configure command, this command causes the commandable entity to go into
the ERROR state with no configuration.
If issued at any other time, this command does nothing except change the commandable entity to the
ERROR state. In particular, an abort received during enable will leave the system enabled and
taking data, but in the ERROR state from the command processing standpoint. Note that stopping
the processing of any commandable entity is handle by the disable command, not the abort
command.
2.1.8 reset command
This command performs the equivalent of the disable command and leaves the commandable
entity in the IDLE state with no configuration.
In addition to the above commands, the Base DMCS subscribes to and responds to the following events
published through the OCS DDS:
2.1.9 startIntegration event
Upon receipt of an startIntegration event, if the Archiver has been enabled, the Base DMCS
launches replicator jobs. One job is launched for each science raft (21) and one more job is launched
to handle wavefront sensor images. The middleware will preferentially allocate these jobs to the
pool of fully-operational replicators, falling back to the pool of local-only replicators if more than
two jobs are assigned per fully-operational replicator. (See section 2.3.1 below for a more complete
description of the replicator pools.)
If a replicator machine fails, the Orchestration Manager will automatically reschedule its job on
another replicator machine (or a Catch-Up Archiver replicator).
The Base DMCS will track the submission, execution, and results of all replicator jobs using
Orchestration Manager facilities and the Alert Production control database.
2.1.10 nextVisit event
Upon receipt of a nextVisit event, if the Alert Production Cluster has been enabled, the Base
DMCS launches worker jobs. One job is launched for each CCD (189) and four more jobs are
launched for the wavefront sensors. These jobs are sent to the Orchestration Manager for
distribution to the worker machines.
If a worker machine fails, the Orchestration Manager will automatically reschedule its job(s) on
another worker machine (at lower priority, so that it can be suspended or terminated if the machine
is needed to handle a current visit).
The Base DMCS will track the submission, execution, and results of all worker jobs using
9
Orchestration Manager facilities and the Alert Production control database.
2.2 EFD replication
Not included in the Alert Production per se but closely tied to it is replication of the Engineering and
Facility Database (EFD) from the Summit to the Chilean DAC and the Chilean DAC to the US DAC
and French Center.
The replication is implemented by standard replication mechanisms for the selected database
management system used to implement the EFD. The latency for the replication from the Summit to
the Chilean DAC is anticipated to typically be in the milliseconds, although latencies of up to one visit
time are acceptable. The latency for the replication from the Chilean DAC to the US DAC is to be as
short as possible, constrained by the available bandwidth from Chile to the US, but no longer than 24
hours (except when a network outage occurs). The typical case for Chile-to-US replication is expected
to be seconds or less.
The Alert Production computations will require telemetry stored in the EFD. The design does not rely
on replication for this information, however. At the Base, the local Chilean DAC EFD replica is
queried for some information, but the OCS telemetry stream is also monitored for more recent changes
than are reflected in the results of the query. This essential data is then sent along with the image data
to the Archive for processing. If the replication proves to have sufficiently low-latency and be
sufficiently reliable, it will be easy to switch to an alternate mode where the US DAC EFD replica is
queried for the information of interest.
2.3 Alert Production Hardware
We now describe the detailed operations performed by each Alert Production infrastructure component.
The sequence of operations for a typical visit is shown in Figure 2.
All DM hardware is monitored by DM system administration tools, which publish results via the
Archive DM Control System. Each machine verifies its software installation on boot (e.g. via hash or
checksum).
10
Figure 2: Visit Sequence Diagram
11
2.3.1 Replicator
The replicator's function is to receive raw and crosstalk-corrected images from the Camera Data
System (CDS), transfer them to local storage, and send them over the network to the distributors at the
Archive Center.
There are two pools of replicators maintained: one “fully-operational” pool and one “local-only” pool
of machines that are unable to connect to their associated distributors. (In addition, the Catch-Up
Archiver maintains a separate pool of replicator machines; see section 2.4 .)
When a replicator boots, it establishes a connection with a single, pre-configured distributor (to avoid
complex N-to-N connectivity). It also checks its connection with the network outage buffer, the Base
raw image cache, and the tape archive. When all connections have tested successfully, the replicator
registers itself with the Orchestration Manager in the fully-operational pool. If a connection to the
distributor cannot be made, perhaps because the distributor is down or because the network is not
operational, it registers itself in the local-only pool.
Replicators execute replicator jobs. These are of two types: science sensor jobs and wavefront sensor
jobs. Both types of jobs perform essentially the same tasks, just with different data. Science sensor jobs
deal with the 21 science rafts, each composed of 9 CCDs or sensors. Wavefront sensor jobs deal with
the four wavefront sensors located on the four corner rafts.
First, the job sends the visit id, exposure sequence number within the visit, and raft id (for science
sensor jobs) that it received from the Base DMCS to the replicator's connected distributor. It queries the
Base Engineering and Facility Database replica for information needed to process the image.
Subscriptions to the CCS startReadout event and OCS telemetry topics are made; the latter topics
are monitored for updates to key values, including a flag indicating whether the system is taking
science data. When the startReadout event occurs, the image id information in the event is used to
request retrieval of the crosstalk-corrected exposure for the raft using the CDS client interface7,
blocking until it is available. When the CDS delivers the image, its integrity is verified using a hash or
checksum, and the image and associated telemetry is sent over the network to the distributor,
compressing it if configured. Simultaneously, the image is written to the network outage buffer and the
raw image cache using the Data Access Client Framework. The latter two transfers are retried if
necessary (up to a configured number of retries). All images that are written are tagged with the
Archiver mode. After the crosstalk-corrected image has been sent, the raw exposure is retrieved. That
image is then sent over the network to the distributor and simultaneously written to the network outage
buffer and the tape archive. All successful (and unsuccessful) image transmissions over the network are
recorded to the Alert Production database. (Successful writes to the tape archive could also be recorded
in the database for convenience, although that poses the possibility of disagreement between the
database and the tape archive.)
(In some calibration or engineering modes, there may only be raw image data, not crosstalk-corrected
7
Interface Control Document: Data Acquisition Interface between Data Management and Camera, LSE-68.
12
image data; the replicator job configuration will provide for this.)
If data cannot be sent to the distributor, or if disconnection from the distributor is detected by heartbeat
ping at any other time, the replicator unregisters from the fully-operational pool and registers in the
local-only pool. Similarly, if the connection is re-established in the future, the replicator unregisters
from the local-only pool and re-registers in the fully-operational pool.
Writing to the tape archive system is obviously done in timewise order. The tape archive itself uses its
built-in disk caching capability to reorganize writes to the tapes in a spatially localized manner to
maximize the ability to read back data for a single area of sky without changing tapes.
Replicators are primarily constrained by their output bandwidth, not by the number of cores. Each
replicator job is assigned to one machine; replicators normally execute only one job at a time. The pool
of replicators (and thus distributors, since they are paired) must therefore be at least 21 + 1 machines,
including one for each science raft plus one for the wavefront sensors; 25 is suggested as a minimum to
provide hot spares for possible failures.
2.3.2 Distributor
The distributor's function is to receive raw and crosstalk-corrected images from the replicator, transfer
them to local storage, and repackage them for the Alert Production Cluster workers.
When a distributor boots, it checks its connection with the network, the Archive raw image cache, and
the tape archive. When all connections have tested successfully, the distributor waits for a connection
from its associated replicator.
Upon receipt of a visit id, exposure sequence number, and raft id from the replicator, the distributor
publishes them along with its network address to the Archive DMCS. Workers can connect to the
distributor to request a CCD-sized crosstalk-corrected image.
When a distributor receives a crosstalk-corrected image and associated telemetry from the replicator, it
verifies its integrity using a hash or checksum, writes it to the raw image cache using the Data Access
Client Framework, decompresses it if necessary, separates it into individual CCD-sized portions, and
sends those portions to the appropriate connected workers. When the distributor receives a raw image,
it writes it to the tape system. All images written are tagged with the Archiver mode.
There is one distributor for each replicator.
2.3.3 Worker
The worker's function is to generate Level 1 data products from the images.
When a worker boots, it checks its connection with the network, its local scratch disk, the master
calibration image storage, the calibration database, the template image storage, the calibrated and
difference image cache, the Level 1 database, and the local alert distribution point.
A worker job, which is written using the Pipeline Construction Toolkit, is started with a visit id, the
number of exposures to be taken, a boresight pointing, a filter id, and a CCD id. The job begins by
13
computing a spatial region that covers the expected area of the CCD plus a margin. It then retrieves the
template image (by filter and airmass), Objects (from the last Data Release), DIAObjects, past
DIASources, and SSObjects that overlap that region using the Data Access Client Framework. It also
retrieves the master calibration images appropriate for that CCD and filter. Note that we have the time
from the nextVisit event to the completion of the first exposure of the visit, which is a minimum of
15 seconds, to start the worker job and perform this retrieval.
The job contacts the Archive DMCS to determine the appropriate distributor for the first image for the
visit and raft. This is a blocking call. When the distributor is known, the image is requested from it,
also via a blocking call. After that image, and associated telemetry, has been retrieved and its integrity
verified via hash or checksum, instrument signature removal may be performed, if configured.
Succeeding images are requested in the same way, again by contacting the Archive DMCS and then
the distributor. When the second image of a pair is received, along with associated telemetry, it
performs the Alert Production processing to generate DIASources, update DIAObjects, and issue Alerts.
The Alert Production processing includes elements from the Image Processing Pipelines, Association
Pipelines, Alert Generation Pipeline, Moving Object Pipelines, and Difference Imaging Pipeline. This
includes instrument signature removal (ISR); CCD assembly from constituent amplifiers; cosmic ray
removal and visit image combination; image calibration (WCS, PSF, and background determination);
image differencing with the template; detection and measurement on the difference image; forced
photometry on the calibrated exposure at the positions of the difference image detections; spatial
association of DIASources with SSObjects (at positions interpolated using pre-computed coefficients
and the exact midpoint of the exposure) and DIAObjects; creation of new DIAObjects for any
unassociated DIASources; science data quality analysis (SDQA) on all data products; and generation of
Alerts for all relevant DIASources.
DIASources, DIAObjects, and SSObjects are updated (append-only) in the Level 1 database. Alerts are
sent to the local alert distribution point. The calibrated and difference images are written to their
respective caches. All images written are tagged with the Alert Production Cluster mode. The Data
Access Client Framework is used for all of this output.
Information from the image calibration and SDQA, including the WCS and information about the PSF,
is sent via the DM Event Services to the Base DMCS, which then publishes it via DDS as telemetry.
If the algorithms require communication of data between CCD jobs, either to determine global, focalplane-wide values or to retrieve certain data from neighboring CCDs, the DM Inter-Process Messaging
Services are used. These services may be implemented using two technologies, transparent to
application code:
1. The jobs may communicate via the DM Event Services.
2. The jobs may be submitted as an HTCondor MPI universe job and then may communicate via
MPI.
In addition, the worker jobs themselves are likely to (non-transparently) use thread-level parallelism to
achieve sufficient performance while processing the CCD.
14
Since the worker jobs are expected to take longer than the inter-visit time to run, two “strings” of
worker machines are needed so that one string is available for the current visit while the other is
processing the last visit. These strings are implemented as a double-sized pool of worker machines.
There need to be at least 193 workers per string, or 386 total workers. 400 workers are recommended to
deal with failures, slow processing, or other issues. Each worker executes on a set of cores on one
machine, typically 16 (one for each amplifier within the CCD). Since we are anticipating at least 20
cores per processor and two processors per machine for the pre-commissioning nodes, each machine
would have two workers (plus 8 extra cores for I/O and ancillary tasks). We thus require at least 200
worker machines. While a pool of dedicated Alert Production workers will be available, additional
machines from the Data Release Production cluster may also be used if necessary.
If a worker job fails for a non-application reason (i.e. a failure that is expected to be transient and nonreplicable), the job is restarted automatically by the Orchestration Manager on a spare machine. A
restarted job may need to obtain its data from the raw image cache rather than a distributor.
As the Level 1 data products are generated at the Archive Center, they are replicated to the US DAC
and the Chilean DAC (over the WAN) via DM File System Services and native replication for the
Level 1 database.
2.4 Catch-Up Archiver
The Catch-Up Archiver transfers images from the camera that were not retrieved due to an error or
outage. It also transfers images from the network outage buffer to the Archive Center.
The Catch-Up Archiver has its own replicators and distributors. These nodes communicate similarly to
the replicators and distributors of the Archiver commandable entity.
The Base DMCS scans the Camera buffer for images that have not been archived to tape (or
transmitted over the network). Each of those images triggers a replicator job. The oldest images will be
submitted first. The Base DMCS also scans the network outage buffer for images that were not
transmitted (as recorded in the Alert Production database). Those images also trigger a different
replicator job that retrieves its data from the buffer instead of the camera.
Images handled by the Catch-Up Archiver are not processed by the normal Alert Production Cluster.
The Base DMCS may be configured to submit worker jobs to a separate pool of workers for catch-up
processing of these images.
2.5 Calibration image and engineering image modes
When the DM Archiver and Alert Production Cluster are configured in these modes, there are no visits.
The startExposure event is used to trigger both replicator jobs and worker jobs (although another
event could be used to trigger the workers). Worker processing only performs ISR (often just a subset),
CCD assembly, PSF determination (if appropriate), and a subset of SDQA, as configured for the mode
selected.
15
2.6 Daytime DM operations mode
In this mode, the Alert Production Cluster is used to perform SSObject detection and orbit fitting
(DayMOPS) and other maintenance tasks, including updating DIAObject and DIASource caches and
projecting SSObject orbits for the next night. The Archiver may be enabled while the Alert Production
Cluster is in this mode, but no processing of any images will occur and the distributors will never
receive requests from the workers. The Catch-Up Archiver may be enabled.
The Base DMCS will submit jobs to the Orchestration Manager as necessary to perform the daytime
tasks.
2.7 Failure Modes
In the event of a failure of the Summit-to-Base network link or Base power and the consequent loss of
DM functionality, the Summit has sufficient analysis capability to be able to proceed with observations
independently, writing images to the CDS buffer. The Catch-Up Archiver will then be used to retrieve
and archive these images when connectivity is restored. No alerts are produced, and no feedback
telemetry from DM goes to the Camera or Telescope, of course.
In the event of a total failure of the Base-to-Archive network link, the replicators will detect loss of
connection to the distributors, register themselves into the local-only pool, and write to the local tape
system and the network outage buffer. The Network Operations team will be notified to investigate and
resolve the issue. The Catch-Up Archiver is again used to retrieve and transmit these images to the
Archive when connectivity is restored. Again, no alerts are produced, and no feedback telemetry goes
from DM to the Camera or Telescope. (If desired, spare hardware at the Base such as the
commissioning compute cluster could be assigned to a worker pool to do a limited amount of
processing to provide feedback telemetry and even some alerts, but this is not part of the baseline.)
In both network failure cases, if the outage is a “black swan” that extends for longer than has been
anticipated in the buffer sizes, media shipping will be used as a backup image transfer channel. Images
and associated telemetry from an EFD replica will be copied onto a disk array (possibly solid state disk)
at the Summit or Base, as appropriate. The array will then be shipped to the Base or Archive,
respectively (and then shipped back once the data has been extracted). Multiple arrays will be required
to handle expected shipping and data transfer times.
In the event of a partial failure (e.g. a slowdown) of the Base-to-Archive network link, the replicator
jobs will detect that they are not completing in the expected amount of time. As they detect this, the
replicator machines will re-register themselves in the local-only pool. If sufficient pairs do so, the
Network Operations team will be notified to investigate and resolve the issue. After random time
intervals, as long as heartbeat messages from their paired distributors continue to be received, the
replicators will re-register themselves in the fully-operational pool so as to enable automatic recovery.
For a more comprehensive discussion of network failures and network operations, refer to the LSST
Observatory Network Design (LSE-78) and the LSST Network Operations and Management Plan
(Document-11918).
16
If a replicator, distributor, or worker dies, a spare will be used automatically by the Orchestration
Manager. If the Base or Archive DMCS, the DM Event Services Broker, or the Orchestration Manager
itself dies, a spare will be brought online. Since these machines maintain little state, a replacement
should be available rapidly without missing many visits.
The network outage buffer is designed to be single-fault-tolerant. If the tape system or shared disk
become unavailable due to faults, the Catch-Up Archiver can be used with the network outage buffer
when they return.
If an EFD replica fails, queries can be directed to the next master up the chain (US DAC to Chilean
DAC, Chilean DAC to Summit) until a new slave can be brought online and synchronized.
If the calibration, Level 1 catalog, or Alert Production control database fails, a hot spare replica will be
reconfigured to be the master.
If the application software fails on a given sensor (or if the sensor itself does not produce data or
produces invalid data), the Alert Production algorithms will be designed to continue processing in its
absence. Job failures of this type will be communicated to the Orchestration Manager and will not be
rescheduled.
If the Alert Production workers get behind, the Orchestration Manager will begin to schedule worker
jobs on spare worker hardware. If so configured, it may also schedule jobs on the Data Release general
compute pool. If enough unexecuted jobs pile up because processing is too slow, as determined by
monitoring the Orchestration Manager's queue length, the Base DMCS will kill the oldest unexecuted
jobs to get below threshold. In addition, the Base DMCS configuration will allow sampling of visits for
worst-case scenarios, in which only a fraction of visits actually spawn worker jobs.
2.8 Maintenance and Upgrades
New Alert Production software will be deployed during daytime maintenance periods. Full integration
tests of the new configuration on both a dedicated integration cluster and the production hardware will
be performed before the software is certified to go live for science observing. Each class of machine
(e.g. replicator, distributor, worker, DMCS) will be uniform in terms of software, from the operating
system through the application code. Cluster configuration management software like Chef or Puppet
will be used to enable and ensure this.
The Alert Production compute load does not increase significantly with time. (Only moving object
prediction and association get noticeably harder.) As a result, new hardware will be deployed primarily
to replace failed components and at specified hardware refresh intervals to avoid obsolescence. Full
integration tests of the production software on the new hardware will be performed before science
observing, with fallback to the old hardware in case of difficulty. Since new hardware is expected to
have at least the same performance as old hardware, heterogeneity of hardware within a machine class
will be permitted. This simplifies the upgrade process and avoids the need to change out many
machines at the same time.
17
3 Calibration Products Production
The Calibration Products Production's primary responsibility is to produce the master calibration
images and calibration database needed to perform instrument signature removal in the Alert
Production and Data Release Production. This includes computation of the crosstalk correction matrix,
which is then delivered to the Camera DAQ. It also has a separate mode for use before the Data
Release Production that computes more detailed per-exposure calibration information based on EFD
telemetry and auxiliary instrumentation (such as the auxiliary telescope spectrograph). It runs
periodically at the Archive as needed depending on the measured stability of the Camera.
In its main mode, the production obtains recent raw calibration images and associated telemetry from
the raw image cache, including bias frames, dark frames (if necessary), flat frames, and fringe frames
(if necessary) using the Data Access Client Framework. Although these images need to be processed
sequentially (so that biases can be removed from flat frames, for example), these images can generally
be processed on a per-CCD (per-sensor) basis, allowing division into 189 (plus 4 for wavefront sensors)
separate jobs. The Archive DMCS submits these jobs to the Orchestration Manager for execution on a
portion of the general Archive compute pool. Each job writes its resulting master calibration images to
the shared disk image storage at the US DAC using the Data Access Client Framework and writes other
information to the calibration database at the US DAC. These master calibration images and database
records are then replicated to the Chilean DAC and the French Center. It is not expected that interprocess communication (i.e. inter-sensor data movement) will be necessary to produce suitable master
calibration images at the ISR level, though the architecture permits it.
Crosstalk correction matrix computations will initially proceed on a per-CCD basis as well, but it will
require inter-process communication. This will be provided by the Inter-Process Messaging Services.
In its pre-DRP mode, separate jobs will analyze the telemetry in the EFD, including auxiliary telescope
spectra, to determine detailed calibration models. These models include the system bandpass function
for every visit. This information will be written to the calibration database at the US DAC and then
replicated to the Chilean DAC and the French Center. Note that new versions of this information for
every exposure will be calculated each time; old versions will be maintained. These jobs will be
partitioned by time period, allowing parallelism for this operation.
4 Data Release Production
The Data Release Production's primary responsibility is to produce the Level 2 data products for each
Data Release, typically on an annual basis although the first data release will process the first six
months' worth of data.
The Data Release Production operates autonomously and is not under the control of the Observatory
Control System. It is managed by the Archive DMCS, which submits jobs to the Orchestration
Manager for execution on the general Archive compute pool.
The Data Release Production is handled by the following infrastructure components located at the
Archive Center at NCSA, Illinois, USA and at the French Center at CC-IN2P3, Lyon, France:
18
1. Archive DM Control System
2. Tape archive
3. Shared scratch disk
4. Compute nodes
5. DM Event Services Broker
6. Shared disk for Level 2 data products at the US DAC
7. Level 2 database at the US DAC
8. Data Release Production (control) database
The Archive Center and the French Center are configured identically to minimize software and data
porting difficulties. Each Center controls its own resources; aside from raw data transfer, neither is a
master or a slave to the other. The two Centers communicate via a limited series of data transfer
operations; their Event Brokers do not communicate with each other. Each Center receives all of the
raw data; each processes half of the data, divided spatially, during the Data Release Production.
All DM hardware is monitored by DM system administration tools, which publish results via the
Archive DM Control System. Each machine verifies its software installation on boot (e.g. via hash or
checksum).
4.1 Overall Sequence
Many of the Data Release Production algorithms are expected to involve computations across the full
set of available images, at least in one region of the sky and possibly across the entire survey area. It is
impractical to perform these computations in an incremental fashion. Therefore a “freeze date” must be
chosen which delineates the latest image to be included in the DRP processing.
The complete set of raw images is replicated to the French Center as they arrive and are written to tape.
This replication is separate from and independent of the Alert Production.
After the freeze date is selected, the Calibration Products Production is run in pre-DRP mode, which
recalculates all of the master calibration images and the calibration database to be used for all the
exposures up to that date.
Second, a region of the sky (about 5-10% of the total survey area) is processed through the entire DRP,
treating it as if that were the entire survey. The results of this processing are analyzed and verified to
ensure that the software is performing properly.
Finally, after any software fixes or configuration changes resulting from the single-region analysis, the
entire sky is processed. This involves transferring intermediate and final data products between the
Archive Center and the French Center.
When the complete set of Level 2 data products has been generated, it is transferred to the Chilean
DAC (and any other non-project stand-alone DACs that provide the necessary bandwidth resources).
19
For the Chilean DAC, this transfer nominally occurs by writing the data products to disk and shipping
the disk to Chile, although an alternative path via high-speed network is being considered.
4.2 Detailed Sequence
The DRP computation can be considered to have several major segments:
1. Image processing
2. Global astrometric and photometric calibration
3. Coaddition, template generation, and difference imaging
4. MOPS
5. Object characterization
6. Global photometric calibration
The initial processing of raw images proceeds in spatial order as tapes are read in (and verified).
Instrument signature removal, CCD assembly, cosmic ray removal, and image calibration all occur,
giving temporary calibrated exposures. Single-frame detection and measurement algorithms are applied.
This results in a catalog of single-frame Sources. The Archive and the French Centers operate on
separate spatial areas, exchanging Source catalogs afterwards.
After all images have been processed, a global astrometric and photometric calibration is performed at
the Archive Center. This process uses the information from the Sources of designated calibration
objects to refine the relative positioning and compute the gray extinction map for each exposure. The
results are used to align images for coaddition and object characterization and to calibrate the
photometric measurements of every Source and, later, Object and ForcedSource. The astrometric and
photometric calibration parameters are transferred to the French Center.
The images are then reprocessed by spatial region, warped, and co-added (using outlier rejection to
avoid moving objects) to form patches of deep coadds and (shallower) templates.
The templates are used for image differencing, detection, and measurement, resulting in DIASources.
These are matched against known SSObjects. Any remaining DIASources are grouped into
DIAObjects.
Single-DIASource DIAObjects are processed by MOPS at the Archive Center to find moving objects,
adding to the list of SSObjects (and removing the DIAObjects).
The remaining DIAObjects are force-photometered on all difference images for which they did not
already have DIASources.
The per-filter coadds are used to generate a chi-squared coadd for deep detection, with masking of
DIAObjects, resulting in a catalog of CoaddSources. The catalogs of CoaddSources, Sources, and
DIAObjects are associated and used to create a master catalog of Objects. This Object catalog is passed
to object characterization, resulting in optimized measurements and a model of each object. The models
20
are used to perform forced photometry on each calibrated exposure, generating ForcedSources.
During the above processing, the Archive and French Centers operate on separate spatial areas. They
then exchange results (coadds, templates, Sources, Objects, DIASources, DIAObjects, and
ForcedSources), merge the two sets of results, and verify that the merged results are identical.
The Level 2 catalogs are ingested into a temporary Level 2 database at each Center as portions are
generated, and the metadata for the Level 2 image products is also ingested into the database. A final
ingest step loads any remaining SDQA metrics into the database at each Center.
SDQA is performed continuously at each Center at each step as the Level 2 data products are generated.
Metrics from SDQA may be used in succeeding steps (e.g. to avoid low-quality images during
coaddition). Additional SDQA (automated and manual) is performed after the data products are
complete.
The Level 2 data products are sent to the Chilean DAC and installed there. They are copied from Data
Release Production scratch space and the Data Release Production database to the US DAC.
The Level 2 database and images are then released simultaneously at the US and Chilean DACs.
4.3 Parallelization
In order to accomplish the heavy computational load required by the Data Release Production,
parallelization across large numbers of cores is required. Most pipelines are parallelizable over obvious
data units such as images, sky patches, or Objects. MOPS is parallelizable over lunation time periods.
These data units will generate thousands to millions or even billions of independent tasks, which will
be grouped into jobs of appropriate length, on the order of single-digit hours. In particular, the object
characterization may be done on all Objects within a sky patch to minimize I/O of image pixels. These
jobs will be submitted to the Orchestration Manager.
The global astrometric and photometric calibrations involve solving extremely large but sparse matrix
algebra problems. Algorithms for doing these computations are parallelizable but require message
passing as opposed to being independent tasks. These will be written using MPI as wrapped within the
Inter-Process Messaging Services.
The entire DRP computation can be characterized as a static directed acyclic graph (DAG) with data
dependencies as edges. For example, even though a patch of coadd may only end up using a subset of
the overlapping calibrated exposures (due to quality cuts), it must be considered to be dependent on all
of them. (If the coadd requires global astrometric calibration, this will actually include every calibrated
exposure, not just the ones that overlap the patch.) This DAG will be handled implicitly in the coding
of the scripts for performing the major DRP segments. The boundaries between major segments will be
under human operator control, as the progression from one segment to the next is expected to involve
review and acceptance of SDQA results. It is possible for the entire DAG to be handled automatically if
sufficient confidence is established in the processing and algorithms.
The start, status, and result information (including timing) for each job will be tracked by the
21
orchestration Manager and reported to the Archive DMCS for overall progress monitoring.
4.4 Input and Output
As jobs built using the Pipeline Construction Toolkit execute, they retrieve files from and persist files
to tape, shared scratch disk, and the in-progress Level 2 data products storage using the Data Access
Client Framework. They may also query the Data Release Production database using that framework.
Jobs do not write directly to the database; instead, for efficiency and avoidance of transaction locking,
they write files that are ingested into the database at a later time.
All data products are backed up to tape as they are written to the data products storage.
At the time of the data release, the in-progress data products are made available to the DAC while the
oldest data release (except DR1, which is always kept) is removed.
4.5 Failure Modes
Since the Data Release Production is composed of many restartable jobs, hardware failures are
typically handled by rescheduling on another node from the general compute pool. The shared disk
systems are designed to be at least two-fault-tolerant (RAID 6).
If the application software fails on a given data element, which is expected to be a rare occurrence due
to the planned implementation of automatic adaptive parameter configuration, the production can be
instructed to continue without those results or a manual execution of the failed job with new
configuration parameters can be performed. Such parameters are recorded for provenance purposes.
4.6 Maintenance and Upgrades
New Data Release Production software will be frozen at or before the data “freeze date”. While the
software will have been tested on the dedicated integration test cluster, the initial 5-10% region
processing also serves as a final verification pass on the production hardware. Each class of machine
(e.g. replicator, distributor, worker, DMCS) will be uniform in terms of software, from the operating
system through the application code, and the software at the Archive Center and French Center will be
identical. Cluster configuration management software like Chef or Puppet will be used to enable and
ensure this.
New hardware will be deployed throughout the course of the survey to add capacity, to replace failed
components, and at specified hardware refresh intervals to avoid obsolescence. New compute machines
will be tested as part of the development and integration clusters before deployment in the operational
compute cluster. Compute cluster machines targeted for removal will be drained of jobs, de-registered
from the general compute pool, and then shut down. Disks will be handled similarly, with logical
volumes expanded across new disks and contracted away from disks to be removed. In addition, there
is downtime between the completion of one DR and the beginning of processing for the next in which
major upgrades, especially to central services like the local area network, can be performed.
22
5 Data Access Center
The Data Access Center is composed of the following components:
1. Level 1 database
2. Engineering and Facilities Database replica
3. Daily Level 1 database snapshot
4. Query Services database for Level 2 catalogs
5. Calibration database
6. Raw image cache
7. Calibrated image cache
8. Difference image cache
9. Coadd and template image storage
10. Raw and master calibration image storage
11. Image regeneration service
12. Image cutout service
13. Level 3 database
14. Level 3 file storage
15. Level 3 compute cluster
Note that the tape archive system is considered part of the Archive Center and not part of the US Data
Access Center.
5.1 Databases
The Level 1 database will be updated in an append-only fashion in real time during observing as
DIASources and DIAObjects are identified. It is intended to be queried for individual objects and small
cone searches, not for large statistical queries.
The Engineering and Facilities Database replica will also be updated continuously in (near) real time.
At the beginning of the daytime DM operations mode processing, the Level 1 database will be
snapshotted to a static replica. This replica may be used for statistical queries and analysis.
The Level 2 catalogs will be stored in a large-scale, parallel, distributed database to provide sufficient
performance for not only full-table-scan queries but also near-neighbor queries.
The calibration database will be updated, in append-only fashion, at the conclusion of each execution
of the Calibration Products Production.
All databases provide SQL interfaces (possibly with extensions or limitations) through standard
23
(ODBC, mysqlapi, Python DB-API) APIs. In addition, an ADQL adapter layer may be provided, as
well as adapters to produce FITS table or VOTable results. These adapters will be manifested as Web
services.
Bulk downloads of database information, including databases from older Data Releases, will be
provided in at least the same form as the tape backups of each database, either through separate Web
services or through standard file transfer protocols (e.g. rsync, Globus Online).
5.2 Image Storage
Caches are maintained for raw, calibrated, and difference images, and the full set of raw images is
accessible from the tape system at the Archive and French Centers.
Recent coadd, template, and raw and master calibration images are stored on disk. Historical versions
are accessible from tape.
An image regeneration Web service is provided to produce calibrated and difference images on demand.
This service access the raw, master calibration, and template images from the DAC disk or from tape as
required. Latency when tape is involved will obviously be greater.
The primary access to the image storage and regeneration service occurs through the image cutout Web
service. This takes a request with appropriate parameters such as image type, date, spatial region on the
sky, sensor identification, etc. and retrieves the appropriate image(s), trimming and mosaicking them as
necessary. It can also be used to retrieve stacks of images (“postage stamps”) of a particular object each
time it has been observed. The image cutout service has space dedicated to it in each of the raw,
calibrated, and difference image caches so that multiple requests for the same area can be handled
rapidly.
Additional bulk download interfaces, e.g. to Education and Public Outreach, will be provided through
separate Web services and standard file transfer protocols.
5.3 Level 3 Storage and Compute
DAC users may be allocated database and file storage space for use by their own, possibly proprietary,
computations. At least part of the database space will be distributed on the same nodes as the Level 2
query services database to facilitate joins with the large Object, Source, and ForcedSource catalogs.
Transfer of data to and from Level 3 file storage will be via standard file transfer protocols.
A compute cluster is provided for Level 3 usage. An Orchestration Manager instance will control
access to these nodes by pipelines written using the Pipeline Construction Toolkit that use the Data
Access Client Framework for access to Level 1 and Level 2 data products. Level 3 pipelines may also
use non-LSST-provided libraries, but only if that code can be installed with ordinary user privileges.
Allocations of space and compute resources to users will be performed according to project policy and
enforced by resource management tools as part of the storage infrastructure and Orchestration Manager.
24
5.4 Failure Modes
The smaller databases will be replicated to slave backups that can be reconfigured as masters in the
event of a failure.
The large Query Services database is designed to be at least single-fault-tolerant8, as is the shared disk
used by the image caches and storage.
The image regeneration and cutout services will be replicated across multiple nodes in a standard loadbalanced Web service configuration.
5.5 Maintenance and Upgrades
When new data releases are to be published, data is copied from the Archive Center to the Data Access
Center (either by local network, wide-area network, or disk shipping). The data is incorporated into the
databases and image storage but not made accessible to external users. Testing of the new release to
verify completeness, consistency, and accessibility is then performed. Finally, external user access is
enabled. Access to the third-oldest release (except the first) is then disabled, and its space is reclaimed.
Level 1 database maintenance, including modifications to its schema, will occur on the Archive Center
copy during the day with replication to the DAC copies during a daily maintenance period before the
start of nighttime observing. Level 2 databases in new data releases need not have the same schema as
in previous data releases. Further details on database schema evolution are in the LSST Database
Design document (LDM-135).
When new versions of the DAC software services are to be deployed, they will be brought up on the
same hosts as the operational services, but using different network ports. If the new versions require
different internal data formats, additional reserved space on the storage media (tape, shared disk, local
disk) will be used to hold the transformed data. After the new services are tested, the old and new
versions will be swapped. Once the new version has been proved in operation, the old service will be
disabled and any space used will be reclaimed.
As for the Data Release Production, new hardware will be deployed throughout the course of the
survey to add capacity, to replace failed components, and at specified hardware refresh intervals to
avoid obsolescence. Hardware for small databases will be brought up as slaves and then, during a brief
shutdown, converted to masters. Hardware for the Query Services database will be brought up as
additional replicas of current data and then added to the database cluster. Query Services database to be
retired can just be shut down and removed. Web services hardware can be deployed and retired as
needed. Image storage will be handled using logical volumes, expanding them across new disks and
contracting them away from disks to be removed.
8
LSST Database Design, LDM-135.
25
6 Appendix: Abbreviations
CC-IN2P3 = IN2P3 Computing Center
CCD = charge-coupled device (sensor)
CDS = Camera Data System (also known as DAQ for data acquisition)
DAG = directed acyclic graph
DIA = difference imaging analysis
DIAObject = DIA (variable) object
DIASource = DIA source (measurement of DIAObject)
DM = LSST Data Management
DMCS = DM Control System
DMS = Data Management System
DR = Data Release
EFD = Engineering and Facility Database
IN2P3 = Institut national de physique nucléaire et de physique des particules
ISR = instrument signature removal
LSST = Large Synoptic Survey Telescope
MOPS = moving object processing system
NCSA = National Center for Supercomputing Applications
OCS = Observatory Control System
PSF = point-spread function
SSObject = solar system object
WCS = world coordinate system
26
Download