Instrument Science Report ACS-97-02 Data Compression for ACS M. Stiavelli and R.L. White November 1997 - DRAFT Version 1.0 ABSTRACT The algorithm for on-board compression on the fly of ACS data is briefly reviewed and its benefits discussed. On the basis of this discussion we recommend a compression strategy and briefly list a plan to establish the optimal compression factor. Once the planned test is completed we will recommend an implementation strategy for compression during SMOV and the following cycles. 1. Introduction A single ACS WFC image has a size 6.6 times larger than a full WFPC2 frame, due to the larger number of pixels (4096 square rather than 1600 square). Note that WFPC2 only uses 12 bits per pixel rather the full 16 bits of a two byte word, so that in terms of actual information the ratio raises to 8.7. The 32 MBytes buffer memory of ACS (34 Mbytes including the ‘‘epsilon’’ memory) can accomodate just a single uncompressed WFC frame (or, alternatively, 16 uncompressed HRC or SBC images). The sheer size of these frames makes it interesting to explore the costs and benefits of on-board compression as suggested by the ACS IDT. Note that there is currently no plan to compress the much smaller HRC and SBC images. Although one might debate the relative advantages and disadvantages of lossy compression algorithms, the path followed by the ACS IDT for ACS was that of lossless compression, i.e., a compression scheme that is entirely reversible so as not to lose any data. Such lossless compression is for example regularly carried out in the HST Archive in a way which is completely transparent to the users. From information theory we know that it is impossible to compress a generic image without loss of information. This is because the number of bits measures, by definition, the information there contained. All lossless compression algorithms are then simply ways of reducing the size of some particular image at the expense of increasing the size of some other image, so that a suitably averaged size will be unchanged. A good compression algorithm is very good at compressing the most commonly encountered data sets but is unable to compress rarely encountered 1 data sets. Thus, the effectiveness of a particular compression algorithm will depend on the properties of the particular class of data sets to which it is applied, e.g.,UNIX compress or gzip are very effective in compressing ASCII files but are much less effective on binary files. In Section 2, we will briefly summarize the basic ideas behind the compression algorithm envisaged for ACS. In order to be useful on HST, a compression algorithm must have a minimum guaranteed compression factor, so that software and memory allocations can be made under the assumption that each frame will never exceed some given size. For the reason discussed above, this cannot be guaranteed for a generic data set. However, astronomical images are not generic data sets but are characterized by some common properties, e.g., nearby pixels are correlated. In Section 3, we will discuss how the proposed compression algorithm can provide interesting minimum compression factors. Our plan to obtain a robust estimate of the the minimum safe compression factor is given in Section 4. Our recommendations are summarized in Section 6. We expect to revise these recommendation after TV. 2. The Rice and the White pair algorithms Since, as we have seen, any lossless compression algorithm must expand some data set in order to compress others, it is reasonable to adopt a method which is best suited for the typical astronomical image. A common characteristic of astronomical (and many non astronomical) images is the presence of correlations between neighbouring pixels. Such correlations are due to the intrinsic nature of the sources and to the fact that point spread functions tend to be wider than one pixel, particularly for ACS which is better sampled than, e.g., WFPC2. Another common property of astronomical image is their relatively low filling factor, i.e. many pixels containing only sky and detector noise. Clearly the latter is not verified when imaging bright nearby galaxies, but it applies to distant galaxies and stellar clusters in our own galaxy. The Rice algorithm, used in some space applications outside the Hubble project, takes advantage of these properties of astronomical images by subdividing an image into small blocks (typically 16 pixels each) and providing for each block or each row a starting value (in general the value in the first pixel in the block) and the difference between the value in each pixel and the starting value. These differences will have the least significant bits dominated by noise and thus highly variable, while the remaining bits will only vary very little. The Rice algorithm does not compress the highly variable bits and just compresses the most significant, not highly variable, ones. The reason why the algorithm works is that indeed the differences in value between two neighbouring pixels are mostly clustered around zero. The probability that a difference has some value ∆y is exponentially decreasing with a typical scale h that varies from image to image. It is this scale h that determines the separation between the uncompressed and the compressed bits. Note that cosmic ray hits are hard to compress in such a scheme, since they are characterized by sharp changes in pixel value for neighbouring pixels. 2 Unfortunately, this algorithm even though very fast, was not fast enough for the 386 CPU installed in ACS. As a consequence, one of us (R. White) developed a new, faster algorithm loosely based on the same ideas but now compressing pixel pairs. The inclusion of flags every 8 pixel pairs allows one to keep track of whether each particular pair has been compressed. Thus, pixel pairs that are hard to compress can be left uncompressed and flagged as such. This feature makes the new algorithm somewhat more robust against the occasional cosmic ray hit. On the other hand the new algorithm cannot compress as much in the most ideal circumstances. The theoretical limit for the White pair algorithm approaches a factor 4. In practice, the implementation considered for ACS with a block size of 16 pixels has a theoretical maximum compression rate of 32 ⁄ 9 ≅ 3.56. Benchmarks have shown that this algorithm allows compression at the rate of 1.2 ∗105 pixels/s. 3. Guaranteed Compression Factors The White pair algorithm achieves compression factors that depend largely on the noise level of frames. For internal frames (darks and biases) the noise properties are rather well known and we should expect high compression factors, certainly exceeding 2 and probably exceeding 3. In general, we should expect images taken with A-to-D gains of 2 or higher to be easier to compress than those at gain=1. In fact, higher gain ratios are equivalent to reducing the sampling of the image noise doing, for all practical purposes, a lossy compression in hardware. Similarly, we should expect long exposures and exposures with high mean signal level to be more difficult to compress. The former have more pixels negatively affected by cosmic rays; the latter have higher (Poisson) noise level and thus higher variations from pixel to pixel. It is likely that typical minimum guaranteed compression factors of about 2 or higher will be achieved. In order to have robust estimates of the kind of compression factors that can be achieved, it will be necessary to carry out an extensive set of experiments. So far one of us (R. White) has done a number of tests on real WFPC2 images, in particular: • a dense star cluster • a big, field-filling, elliptical galaxy • a deep HDF exposure • a long, low S/N, UV exposure • several short exposures These tests always showed compression factors exceeding 2. More tests will be carried out at Ball. However, already at this stage we can see how an optimal strategy would involve the definition of a table specifing a guaranteed compression level as a function of gain setting, filter, and exposure time. Should such an optimal strategy be considered 3 unfeasible, the convenience of data compression on the fly will depend on whether detailed simulations will show that our expectation of minimum guaranteed compression factors exceeding 2 will indeed be possible. One concern on the actual implementation has to do with the number of simultaneous amplifiers used for the WFC readout. Indeed, the planned scheme is to compress on the fly the output of one amplifier and to compress in buffer the other three. The compression of one forth of the data on the fly, can free up sufficient space to start the compression of the remaining outputs. In case of two amplifiers operation the cpu load for on the fly compression would probably be too high, i.e. about 97 per cent, and therefore one would probably be unable to compress the data due to lack of buffer space. Subarrays are always compressed on the fly since they are read with a single amplifier. It is worth noticing that in the current implementation of the compression algorithm the minimum guaranteed compression is also the maximum since buffer space is allocated according to the minimum guaranteed compression rate and any unused space (in case of a better compression) is padded. If the guaranteed compression is not achieved on a single 2048 pixels data segment, some data are lost and compression continues on the following segments. Clearly such losses, if rare, can be acceptable for calibration images like biases, or earth flats but would be unacceptable for science data, which must therefore use a conservative compression factor. According to current plans the compression factor is an Engineering only Phase II parameter. 4. A Strategy for Deriving the Compression Factor The previous discussion makes it clear that deriving a firm minimum compression factor is very important. For this reason we have sketched a test plan aimed at obtaining and verifying such a quantity. In order to carry out the test on images as realistic as possible we suggest: • the test has to be carried out on artificial images simulating real ACS images as well as possible. • the artificial images should be constructed from WFC dark frames, and should include, in addition to the astronomical objects and background, also source and background poisson noise, read-out noise, and dark current. • given the higher cosmic ray rate observed on orbit, additional cosmic rays will have to be added to the artificial images. • the procedures to produce artificial images and test them will have to be prepared in suitable script, so as to make it easy to repeat them if and when improved dark frames (from e.g. thermal vac and, later, SMOV) will be available. • test images will be produced to simulate the whole range of HST targets (galaxies, globular clusters, planetary nebulae, deep exposures, clusters of galaxies, biases, darks, internal and earth flat fields, etc). 4 • tests will probably need to be repeated on real data during SMOV, e.g., by observing with ACS a galactic globular cluster and a nearby elliptical. If the minimum compression factor estimated from the synthetic image tests is found to be significantly in excess of 2 one could baseline the use of compression already for Cycle 9, after it has been verified in SMOV. Should the compression factor be found, unexpectedly, to be close to or lower than 2 we should revise the HST DRM for Cycle 9 (which assumes a compression factor of 2) and perhaps consider delaying the beginning of routine compression to a later cycle. 5. Recommendations Assuming that the synthetic image tests confirm that compression factors in excess of 2 can be obtained, our tentative recommendations in order of priority can be summarized as follows: 1. adopt an exposure dependent minimum compression rate. A table would need to be implemented in software to identify the compression rate. Probably three compression settings (low, medium, high) could be sufficient. Dependent on the structure of the available and planned software this solution may end up being too expensive. 2. adopt a special, gain-dependent, minimum compression factor for calibration observations (darks, biases, flats) and a standard, gain-dependent, minimum compression rate for all other observations. 3. adopt a gain-dependent minimum compression factor for all images. 4. adopt a constant minimum compression rate. We expect to be able to formulate final recommendations after TV (when the final flight CCD parameters will be determined) We believe that compression is essential for a proper parallel operation of ACS (see e.g. ISR ACS-97-01) and, therefore, it should be abandoned only if serious problems with the algorithm and its implementations are uncovered. 6. Acknowledgements Many thanks Chris Blades for useful discussions. 7. References Advanced Camera for Surveys (ACS) Science Operations Requirements Document - Part B (Op-01), version 10/27/97, Sections 2.1.4, 3.1, and 3.5. HST Reference Mission for Cycle 9 and Ground System Requirements, ACS ISR-97-01. Flight Software CDR 5