LSST Database Storage Sizing Model LDM-140
1
7/12/2011
LSST Compute Sizing Explanation LDM-140 7/12/2011
Version Date Description
1
2
1/30/2006
7/12/2011
Initial version
Complete rewrite
Owner name
KT Lim, Chris Smith,
Tim Axelrod, Gregory Dubois-Felsmann,
Mike Freemon
Kian-Tat Lim i
LSST Compute Sizing Explanation LDM-140 7/12/2011
Change Record .......................................................................................................................................................... i
1.
Introduction
................................................................................................................................................... 1
2.
Definitions
..................................................................................................................................................... 1
2.1.
Tcyc = trillion clock cycles
................................................................................................................... 1
2.2.
TFLOPS = trillion floating point operations per second
................................................................... 2
2.3.
GB/core = gigabytes per core
............................................................................................................ 2
3.
Inputs
............................................................................................................................................................. 2
4.
Calibrated Science Image Generation
.......................................................................................................... 2
4.1.
Crosstalk Removal
............................................................................................................................... 3
4.2.
Instrument Signature Removal and CCD Assembly
.......................................................................... 3
4.3.
Cosmic Ray Split
.................................................................................................................................. 3
4.4.
Image Characterization
...................................................................................................................... 3
4.5.
Subtotals
.............................................................................................................................................. 3
5.
Alert Production
............................................................................................................................................ 4
5.1.
Difference Imaging
............................................................................................................................. 4
5.2.
Fake Object Insertion
......................................................................................................................... 4
5.3.
DIASource Detection and Measurement
........................................................................................... 4
5.4.
NightMOPS
.......................................................................................................................................... 4
5.5.
Alert Generation
.................................................................................................................................. 5
5.6.
SDQA
.................................................................................................................................................... 5
6.
Data Release Production
............................................................................................................................... 5
6.1.
Sequencing
.......................................................................................................................................... 5
6.2.
Source Detection and Measurement
................................................................................................. 5 ii
LSST Compute Sizing Explanation LDM-140 7/12/2011
6.3.
Co-addition
.......................................................................................................................................... 5
6.4.
Object Characterization and ForcedSource Measurement
.............................................................. 6
6.5.
Astrometric Calibration
...................................................................................................................... 6
6.6.
Photometric Calibration
..................................................................................................................... 6
6.7.
Difference Imaging and SDQA
............................................................................................................ 7
7.
MOPS
.............................................................................................................................................................. 7
8.
Calibration Products Production
.................................................................................................................. 7
8.1.
Monthly
............................................................................................................................................... 7
8.2.
Annual
................................................................................................................................................. 8
9.
On-Demand (Image Query)
.......................................................................................................................... 8
10.
L3 Community
............................................................................................................................................... 8
11.
Cutout Service
............................................................................................................................................... 8
12.
EPO Service
................................................................................................................................................... 8
13.
Memory
......................................................................................................................................................... 8 ii
LSST Compute Sizing Explanation LDM-140 7/12/2011
This document describes the assumptions and calculations behind the estimates of CPU resources required for the LSST Data Management (DM) system in production. These resources include floating point operations per second (FLOPS), random-access memory
(RAM) per core, and memory bandwidth. Spares, operational overhead, contingency factors, and the commissioning period prior to production operations are not included here; they are instead covered in Document-6284.
CPU resources are allocated to the Archive Site and the Base Site and are computed per survey year.
CPU requirements are calculated for the three LSST productions: the nightly Alert Production that generates near-real-time alerts for varying and moving objects as well as (in a separate mode at the Archive Site) updates the up-to-date catalog; the annual (except the first year) Data Release (DR) Production that reprocesses all survey data into catalogs and co-adds of various kinds; and the Calibration Products Production that runs on a monthly and annual basis to produce master calibration images and the Calibration Database. In addition, separate calculations are done for smaller, but distinctive, portions of the overall DM system, including the Moving Object Prediction System (MOPS), computing to satisfy image queries (but not database queries, which are covered in Document-1779 with explanations in Document-1989), “Level 3” (L3) computational resources for the community, and generation of products for Education and Public Outreach (EPO).
The corresponding detailed spreadsheet can be found in Document-2116.
2.1.
Tcyc = trillion clock cycles
Current and future computing amounts are estimated from prototype implementations by measuring compute times and multiplying by CPU per-core clock speed, producing a number of clock cycles. This is acknowledged to be a very crude estimate, but it is likely to be a conservative one, as work done per clock cycle is more likely to increase than decrease. Note that this trillion is 10 12 , not 2 40 .
1
LSST Compute Sizing Explanation LDM-140 7/12/2011
2.2.
TFLOPS = trillion floating point operations per second
Note that we are making the assumption that 1 Tcyc of processing in 1 sec is approximately 2.6
TFLOPS, based on the mean efficiency of TOP500 supercomputers executing the LINPACK benchmark in June 2011 1 . Our code is likely to be much less efficient, so this is a conservative estimate of the TFLOPS required. On the other hand, if future architectures do not balance increases in floating point speed with increases in integer computation and I/O, our efficiency may go down. This trillion is also 10 12 , not 2 40 .
2.3.
GB/core = gigabytes per core
This should officially be GiB/core (gibibytes/core) to reflect the fact that it is 2 30 , not 10 9 , bytes of
RAM per core, but the familiar label is used instead, even though all calculations use the power of 2. Per-core numbers are used assuming that tasks can be parallelized well across cores, which is true of most current algorithms (exceptions are noted below).
Most inputs to this calculation are taken from the science requirements and system specifications (Document LSE-81).
The storage spreadsheet (Document-1779) provides the number of galaxies and stars observed by the telescope at the end of each survey year.
Other inputs include:
•
Timings for components of the Alert Production
•
Various algorithm design assumptions
•
Prototype implementation characteristics, including clock speeds for the CPUs on which they were run and optimization factors reflecting projected code improvements
We are not permanently storing calibrated science images (CSIs) on disk or tape, so they must be regenerated on demand. Since this step is common to many processes, its estimates are detailed on their own worksheet.
1 http://top500.org/lists/2011/06
2
LSST Compute Sizing Explanation LDM-140 7/12/2011
4.1.
Crosstalk Removal
Crosstalk Removal has not been prototyped. The Tcyc value used was derived from “SM xtalk” running in 30 sec on a 2.8 GHz CPU on an 8192 by 8192 pixel image.
4.2.
Instrument Signature Removal and CCD Assembly
Tcyc values were derived by measuring the timings for the ISR algorithms during Data Challenge
(DC) 3b Performance Test (PT) 1.1 and multiplying by the clock speed of the processor used.
4.3.
Cosmic Ray Split
This step detects and removes cosmic rays and combines the two images from each visit to achieve the desired single-image depth.
For the purpose of these computation estimates, the most complex proposed algorithm was used. The two images in each visit are processed separately through background subtraction and an initial stage of cosmic ray removal. These timings are based on DC3b PT1.1. The images are then subtracted (using a simple pixel-by-pixel algorithm, assumed to be similar to dark subtraction) and further cosmics detected on the difference image (assumed to be similar to detecting image characterization sources). These are masked out as the two images are co-added to form the final visit image (assumed to be similar to flat subtraction, although obviously with opposite sign).
4.4.
Image Characterization
Bright sources are detected and measured. The measurements are used to determine the PSF, the aperture correction, an accurate WCS, and the photometric zeropoint for the CCD.
Tcyc values were derived by measuring the timings for these algorithms during DC3b PT1.1 and multiplying by the clock speed of the processor used.
4.5.
Subtotals
Subtotals are derived for ISR (labeled as Calibration Preparation Total), all single-snap processing except crosstalk correction (Single Snap Post-Crosstalk Total), and all calibrated science image processing through cosmic ray split handling (TOTAL Xtalk/ISR/CR only).
3
LSST Compute Sizing Explanation LDM-140 7/12/2011
The Alert Production at the Base Site is required to send alerts within 1 minute of the end of a visit. At the Archive Site, it has 24 hours to reprocess a night's data and use it to update the upto-date catalog. This production is composed of several pipelines: CSI Generation, Difference Imaging, DIASource Detection and Measurement, NightMOPS, Source Association, and Alert Generation. In addition, a Fake Object Insertion step is proposed to enable completeness and science data quality analysis (SDQA). Timings for these pipelines are taken from DC3b PT1.1 except as described below.
5.1.
Difference Imaging
The timing for this pipeline was taken from DC3a.
5.2.
Fake Object Insertion
The timing for this is based on “SDSS fakes” taking 30 sec on a 2.8 GHz CPU on an 2048 by 1489 pixel image. It is assumed that only half of the CCDs will have fakes added.
5.3.
DIASource Detection and Measurement
It is assumed that detecting and measuring on a difference image requires the same amount of time as detecting and measuring on a calibrated science image.
5.4.
NightMOPS
NightMOPS calculates, for each visit, the projected position of each known moving object that is potentially visible in the field of view. The algorithm for this is likely to be a simple polynomial computation based on parameters computed during the day, but we take a more conservative approach and assume that every moving object (not just those visible) will have a full, accurate orbit determination calculation applied.
The Tcyc for this algorithm is based on a prototype calculation that scales with the number of objects and the number of days that the orbit must be projected from a known epoch (taken to be an average of 0.5, assuming some precomputation during the day). The time available for this calculation starts when the pointing event becomes available from the Observatory Control System (OCS); results must be delivered by the time the Source Association step is ready to run.
4
LSST Compute Sizing Explanation LDM-140 7/12/2011
5.5.
Alert Generation
This pipeline uses the measured properties of the DIASource and the history of its associated
Object (determined by Source Association) to categorize the DIASource, filter it out if uninteresting, and otherwise generate an appropriate alert. No prototype has been developed for this algorithm; instead, a certain amount of time (9 sec) in the Alert Production sequence has been reserved for it and is therefore unavailable for other computing.
5.6.
SDQA
The existing automated SDQA processing time is used to project Tcyc per visit. While this code is in Python and hence can be sped up considerably, additional QA tests will undoubtedly be added.
The Data Release Production reprocesses all survey data to generate Object, Source, and Forced-
Source catalogs, templates for difference imaging, and deep co-adds. It executes once per survey year except for the first, when two data releases are performed.
6.1.
Sequencing
It is assumed that after each year's (6 months' for the first DR) data has been collected that the input data for the DR is “frozen” and production begins. The complete production is allowed to take up to one year (6 months for each of the first two DRs) before delivering its results. This includes SDQA during the production to catch any problems early, while they can still be fixed, in addition to SDQA after the production to provide quality flags and assessments of the data products. The shortened 6 month time period for DR2, which is processing a year's worth of data, increases the TFLOPS rate required for its computations by a factor of 2.
6.2.
Source Detection and Measurement
These were measured during DC3b PT1.1.
6.3.
Co-addition
The Tcyc for co-addition is assumed to be the same as that for difference imaging. Instead of warping an image to a template and subtracting, co-addition involves warping the image to the co-add and adding.
5
LSST Compute Sizing Explanation LDM-140 7/12/2011
The deep per-band and pan-chromatic chi-square co-adds include every visit (with each visit participating in 2 co-adds). For the the per-band, per-airmass templates, we assume that only a sample (10) of the best visits will be used, and that that sample can be identified based on image metadata, without per-pixel computations. Any outlier rejection that occurs during the template co-addition is assumed to be negligible. This makes the template computation dependent on the number of output pixels. Templates are generated over two-year intervals.
6.4.
Object Characterization and ForcedSource Measurement
This pipeline is assumed to use a full forward modeling algorithm based on individual observations. A prototype for this step has been built and scales linearly with the number of Sources. We assume that this algorithm will only be applied to a subset of Objects where it is required. The remaining Objects will be measured on the co-adds or through less computationally-intensive algorithms.
The current prototype is not optimized, and the developer believes that it can be sped up by “a factor of a few” (assumed to be 2X) by being more cache-aware. An effective PSF order of 6 instead of 10 would speed the code up by a factor of 16. Smaller average postage stamps (20x20 instead of 30x30) could add a factor of 2, and fewer iterations in the optimizer would also add “a substantial factor” (also assumed to be 2X). The algorithm is highly suitable for GPU processing.
Combining these, an overall optimization factor of 120 is reasonable.
The input to this pipeline involves detecting and measuring Objects on co-adds. We assume that each deep per-filter co-add and the pan-chromatic co-add will need to be measured and that the measurement scales by the number of pixels, which are four times denser than CSI pixels.
Deblending will occur while measuring on the co-add. This process is expected to add approximately 10% to the measurement time, based on experience from SDSS.
6.5.
Astrometric Calibration
This is assumed to work on patches of size four times the field of view, solving a matrix in each patch. An O(N 2 ) dependency is assumed, with N determined by the number of Objects with detectable Sources per patch and the number of astrometric calibration parameters computed.
6.6.
Photometric Calibration
A prototype solver was built that calibrated 1 million stars in 1.2 million patches covering the sky for 2 years of simulated survey. This took an hour to run on a desktop machine. Since the production number of stars will be about 100 per patch, or 120 times as many, and we have 5 times as many years, we scale this by half of the square of the ratio of production observations to prototype observations (half because of the symmetry of the matrix).
6
LSST Compute Sizing Explanation LDM-140 7/12/2011
6.7.
Difference Imaging and SDQA
These are assumed to be the same as the Alert Production.
The moving object prediction system connects DIASources into tracklets (within a single night) and tracks (over several nights), eventually computing orbital elements. While this process could involve a potentially exponential search through all possible combinations of DIASources, setting reasonable limits on velocities and accelerations can reduce the search space considerably.
Since the algorithm analyzes patches of sky and windows in time, it is expected to scale linearly in both spatial and temporal extent. It should also be possible to parallelize over such patches and windows. In addition, a spatially-bounded search algorithm should be distributable relatively efficiently across multiple nodes.
A prototype implementation of the algorithm was developed and timed while running on a 15 degree by 15 degree patch of sky using DIASources derived from a full-density solar system model over a one year period. The scaling in the number of DIASources is expected to be cubic, based on tests with the prototype. This quantity will be increased over the prototype since the production version will not be operating solely on DIASources from moving objects; noise and transient non-moving objects may also be included. The most conservative estimate is that moving object DIASources will make up only 16% of all MOPS-processed DIASources, corresponding to 20,000 “noise” DIASources per visit. The C++ OrbFit prototype code is not particularly optimal; the FORTRAN implementation is at least 5 times faster, so a similar optimization factor has been applied.
8.1.
Monthly
Every month, the previous month's raw calibration images are assumed to be processed through the full ISR pipeline. (In actuality, calibration images used earlier in the ISR pipeline require less processing than those used later.) The resulting processed images are then co-added to form the master calibration images used during the month's Alert Production. Processing of Engineering and Facilities Database information into appropriate Calibration Database information is also assumed to occur at this time but at negligible processing cost.
7
LSST Compute Sizing Explanation LDM-140 7/12/2011
8.2.
Annual
Before each Data Release Production, every month's master calibration images are regenerated using the latest software. Improved calibration data is written to the Calibration Database.
The computation required here is to regenerate CSIs from raw image data. Only CSIs that are not already in the image cache are generated. The number of CCDs required is assumed to be 1 for so-called low-volume queries and 189 (a full field-of-view) for high-volume queries. The CSIs are then cached for future use and passed to the cutout service.
10% of the Data Release TFLOPS is allocated to L3 Community Service Level 1 computing needs, unless a larger minimum is specified in the Community Access White Paper. Similarly, 2% of the
Data Release Production TFLOPS is allocated to L3 Service Level 2 computing needs, unless a larger minimum is specified. SL3 and SL4 needs are taken directly from the white paper.
This computation extracts pixels from already-generated CSIs, templates, or co-adds. It is assumed that this will involve selecting subareas from images (at near-zero computational cost), warping them to a common pixel grid (using a convolution), and compositing the results.
This computation combines co-add pixels from 6 different filters and perhaps a panchromatic coadd into an RGB JPEG-compressed image mosaic for EPO. A plausible estimate is chosen for the number of cycles needed to generate a result pixel from the input co-add pixels.
Memory bandwidth is estimated for all components based on a simulation of cache accesses while running the DC3b CSI generation and single-frame source measurement pipeline on one
8
LSST Compute Sizing Explanation LDM-140 7/12/2011
CCD. The simulation determined the number of cache misses per instruction executed; this was then conservatively translated into a maximum number of bytes retrieved from memory per
FLOP. Future algorithms are expected to have better cache behavior than the prototype.
For the Alert Production, we conservatively assume that cores are devoted to CCDs, not amplifiers. Each core requires access to the master calibrations for its CCD to be in memory. Since filters will be changed continually during the night, we assume that the flat and fringe master calibration images for all filters must be accessible. On the other hand, templates need to be loaded dynamically since they are position-dependent as well as filter-dependent. We thus include memory for two template images, one in use and one being pre-loaded based on the next telescope pointing. Space is also allocated for the two visit images and one temporary image.
The Data Release Production has three large memory consuming pipelines: Calibrated Science
Image Generation/Single Frame Measurement, Co-addition, and Object Characterization. The memory consumption per core is taken to be the maximum of the three.
The CSI+SFM pipeline potentially requires the same memory as the Alert Production, except that only one filter's worth of master calibrations is needed at any time and no templates are used.
Co-addition may use two template-sized images (the current co-add and the result), as well as a temporary image.
The Multifit algorithm for Object Characterization works on stacks of “postage stamps” around objects, fitting a number of variables to the time-dependent pixel values. The depth of a stack will be greater than average in the special “deep drilling” fields that receive more visits. We allow sufficient memory for every pixel in every visit to have the estimated number of variables associated with it.
MOPS memory usage is primarily set by the number of tracks (multi-night linkages of DIA-
Sources) that need to be kept in memory. These should be able to be distributed across multiple nodes (perhaps up to 100), but the aggregate memory is substantial. The quantity is based on the number of tracks produced in the prototype implementation for a full-density solar system model, multiplied by 2 to reflect invalid tracks due to “noise” sources, and adjusted for the total sky area.
9