Note on the VAO Portal: Image Access Deliverables The PEP calls for … “A new VAO search portal will be deployed within the framework of the VAO Web site. The first portal release will provide the science user with basic search and retrieval capability through a gateway to VAO-enabled applications and services …. Finally, the portal will support fast access to image data collections over wide areas.” This note refers to the last part of the deliverable. Use Cases There are four use cases that have been given to me by scientists. The first two have been distributed in e-mails. 1. I am planning an observing proposal to perform a near-infrared brightness-limited survey of [type of omitted on request] stars in the 30 x 30 degree area of the Taurus Dark Cloud. I propose to use Spitzer to get images, extract sources, and then do ground-based spectroscopic follow-up in the visible and near-IR. I want to know what observations have been made of this area, specifically: - what images have been measured from 1 to 10 microns and by what instruments? - which parts of the cloud have been covered? - what spectroscopic observations exist of [type] stars in the Taurus Cloud between wavelengths of 3000 A and 5 microns? "I was disappointed that after many years of development, there is no NVO service that even comes close to this type of capability" 2. I want to study the rate of massive star formation in the plane of the Galaxy. I want to find out what image data are available in 20 regions of the plane that contain massive molecular cloud complexes, or which will act as control fields, in the wavelength range 3000 A to 1 mm . Each of these regions is about 20 deg x 3 deg in size. I want to discover what image data sets are available in those areas in the wavelength range 3000 A to 1 mm. Then I want to extract fluxes from them with my algorithm. But all these images will have different pixel scales, etc. I want to have the capability to compute images with the same sampling, pixel scale etc., and then download them. 3. I am studying the dust in our Galaxy to develop a modern model of Galactic Extinction, essentially updating the model used in the Schlegel, Finkbeiner & Davis (1998) maps. To start, I have about 50 regions of the sky, at |b|>7 deg, that I want to look at; they are generally 5-10 deg on a side. I would like to examine these regions over many wavelengths to characterize the diversity of dust emission. I would like to start by grabbing all the IR and microwave data in the infrared, plus the Schlegel data, for all these regions. I would like to be able to visualize all these data and download them. Now all these data are in different projections and the microwave data are in Healpix. I have some code that will put everything in the same projection, but it’s typical astronomer code. A service that could do all these reprojections for me would be great. 4. I am interested in finding new high proper motion stars that are candidate brown dwarfs. To do this, I would like to be able to cross-match SDSS and 2MASS (and WISE too when it is released), find candidates that meet particular color criteria and other criteria, and then look at all the visual and near-IR images available for them in chrono order. And tell me the times of the measurements too. Now the interesting part of this comes in looking at the images. Some of them are big, like the DPOSS plates, and some are tiny, like the 2MASS images. The little ones may well miss high PM candidates. I know you guys have a mosaic engine. Can you stitch the smaller images together so that I can see images of uniform size across many bands? The images don’t all have to have the same sampling for this exercise – the scale is what matters. Requirements This is just a very quick summary of the requirements implied by the above; this is not intended as a formal statement of requirements 1. Input a single source of a list of sources. Source lists may be several hundred, but no reason why someone won’t have lists of thousands. 2. Select attributes of images for searches a. Wavelength range b. Name specific data sets c. Time ranges 3. Inputs by target name or position; former implies requirements on name resolution; latter implies we need to be able to handle all common coord systems. 4. Must support spatial areas of 30 deg on a size. I am inclined to think that the search area we should support should be the spatial size of the Galactic Plane – say 360 x 10 = 3600 deg sq 5. Searches need to be done and returned in real time. Need to figure out scale- how may images can be found in these searches? 6. Must be able to package and download data. 7. Return tables of sources and their metadata 8. Ability to sort and filter images by metadata. 9. Must have uniform access to lots of data sets – includes all the “biggies” as far as I can tell. 10. Overlay image footprints on images. 11. Visualize images – usual suspects of panning, zooming. 12. Ability to reproject images on to a common set of projections or scales 13. Capability to reproject images 14. Capability to create mosaics (or do cutouts of existing mosaics). Design Notes 1. Input via a web form easiest. 2. Many of components are in place: a. Need fast inventories of the type supported by the inventory service – need to understand limits of its scalability. b. Table filtering – if done in browser; how does it scale? Can it handle the scale of results here c. Various visualizers and footprint services d. Reprojection and mosaicking codes. 3. Requirements imply uniform access to data – providers need to be willing to expose metadata. An Image Access Interface for the LSST – A “Proof-Of-Concept” Prototype for a VAO Image Access Service A service has been mounted at IPAC to support image access for the LSST, and responds to specifications provided by LSST as part of the LSST Data Challenge 3b. It offers access to many different image data sets from a simple web interface, where users enter a position and radius, and are returned a list of images by project, image metadata, and of course images. Users can visualize images with DS9. Check it out at https://osiris.ipac.caltech.edu/cgi-bin/LSST/nph-lsst Username: LSST; Password: Big-Sky. Intended for evaluation by VAO only; do not distribute. The JPEG below summarizes the features (from the LSSt Handbook). Here are some design notes compiled by John Good, who led the development effort: General Background The basic function of the image inventory is to find images overlapping a location or region as quickly as possible. Since the objects being searched are extended and the search is multi-dimensional, this is best done with an indexing scheme optimized for the problem (e.g. an R-Tree). Note that at this level all images are essentially the same: bounded regions of the sky. Multiple image sets, and indeed other data like spectral slit outlines, can be combined into a single index, greatly speeding up cross-dataset queries. In general, there is very little efficiency to be gained by trying to optimize spatial searching with relational constraints on other image attributes. Given both kinds of constraints, it is better to do one, then the other. In real world applications, so long as the spatial constraint is actually useful (e.g. not things like "everything not in the LMC") it is almost always predominant and the relational constraints can be applied as a second pass. We do this using in-line SQLite processing. Salient Details I'll give these in bullet-like form. I think in most cases the rationale is obvious: Input to the process is a set of simple ASCII tables (VOTable, IPAC ASCII, etc.) These can contain any relational information the supplier desires, though they must contain coverage information either in the form of complete WCS parameters or the "four corners" (RA and Dec J2000) coordinates of a bounding box. Usually this bounding box is a good enough representation of the image/slit that no further filtering is needed though obviously any sort of fancy footprint post processing could be applied. In addition to the spatial information, we require that at least one column in each input table be a URL linking to the actual data file. There can in fact be several URLs if desired (e.g. to a JPEG of the image, to a proposal abstract, etc.) These tables are converted to a fixed-length record format to speed extraction of subsets of records later, though still in ASCII form for readability. FITS tables could have been used but wouldn't improve efficiency anywhere. Any collection of tables can be used together to create a single set of index files. For the R-Tree, we need the flexibility of in-memory addressing but these structures can take up many GBytes and we would rather not have to keep this infrastructure running as daemon processes nor pay the startup penalty this would imply. Instead, the index files have been memory-mapped so they "load" instantly and rely on the OS paging for efficient memory management. This gives us the best of both worlds (memory and files) since we could, if we desired, put the data truly into resident memory by setting up a RAM disk. In order to allow for essentially unlimited scaling, we have in addition implemented a multi-threaded multi-server layer on top of the basic R-Tree service. While a single index could in principle be Terabytes in size it is more convenient to break the data into pieces and parallelize the processing. Besides keeping index file sizes reasonable, this allows us to separate dynamic from static sets, etc. There are four main searches: Get a list of image sets matching the spatial constraint, with counts of the matching images in each set. Get the subset of image records for a given set for the above region. This is where the fixed-length records above come into play as they allow us to seek/read records as the index search identifies them. If the user has a table of locations, find how many of them had matching data anywhere in the index (not too useful if index covers many input tables but very useful if it is built from a single source). For a user table and a specific indexed dataset, make a list of the user records where there are matches). There are probably other searches that would be useful; those are the ones that have had real application so far. The LSST interface augments the above with GUI functionality do to obvious post-processing: Given a region, present the results of the first search above as a little table. Selecting on of the datasets, retrieve and show the records using the second query. This tabular display allows for smooth paging/scrolling for large numbers of records and can also sort and apply SQL filtering to any set of columns. The table display also presents URLs as links, so data can be directly retrieved. Missing from this prototype is a multi-file retrieval mechanism, though the table inself supports multiple record selection. The two principle omissions are the bulk download capability and program-friendly access to the relational filtering, both of which would be easy to implement. At the moment, ingest is handled manually; what we really need is a simple harvesting mechanism (locations and schedules where data can be retrieved; a mechanism for identifying "new" records in datasets that are growing).