- VAO Product Development

advertisement
Note on the VAO Portal: Image Access
Deliverables
The PEP calls for …
“A new VAO search portal will be deployed within the framework of the VAO Web
site. The first portal release will provide the science user with basic search and
retrieval capability through a gateway to VAO-enabled applications and services ….
Finally, the portal will support fast access to image data collections over wide
areas.”
This note refers to the last part of the deliverable.
Use Cases
There are four use cases that have been given to me by scientists. The first two have
been distributed in e-mails.
1. I am planning an observing proposal to perform a near-infrared
brightness-limited survey of [type of omitted on request] stars in the 30 x 30
degree area of the Taurus Dark Cloud. I propose to use Spitzer to get images,
extract sources, and then do ground-based spectroscopic follow-up in the
visible and near-IR. I want to know what observations have been made of
this area, specifically:
- what images have been measured from 1 to 10 microns and by what
instruments?
- which parts of the cloud have been covered?
- what spectroscopic observations exist of [type] stars in the Taurus Cloud
between wavelengths of 3000 A and 5 microns?
"I was disappointed that after many years of development, there is no NVO
service that even comes close to this type of capability"
2. I want to study the rate of massive star formation in the plane of the
Galaxy. I want to find out what image data are available in 20 regions of the
plane that contain massive molecular cloud complexes, or which will act as
control fields, in the wavelength range 3000 A to 1 mm . Each of these
regions is about 20 deg x 3 deg in size. I want to discover what image data
sets are available in those areas in the wavelength range 3000 A to 1 mm.
Then I want to extract fluxes from them with my algorithm. But all these
images will have different pixel scales, etc. I want to have the capability to
compute images with the same sampling, pixel scale etc., and then download
them.
3. I am studying the dust in our Galaxy to develop a modern model of Galactic
Extinction, essentially updating the model used in the Schlegel, Finkbeiner &
Davis (1998) maps. To start, I have about 50 regions of the sky, at |b|>7 deg,
that I want to look at; they are generally 5-10 deg on a side. I would like to
examine these regions over many wavelengths to characterize the diversity
of dust emission. I would like to start by grabbing all the IR and microwave
data in the infrared, plus the Schlegel data, for all these regions. I would like
to be able to visualize all these data and download them. Now all these data
are in different projections and the microwave data are in Healpix. I have
some code that will put everything in the same projection, but it’s typical
astronomer code. A service that could do all these reprojections for me
would be great.
4. I am interested in finding new high proper motion stars that are candidate
brown dwarfs. To do this, I would like to be able to cross-match SDSS and
2MASS (and WISE too when it is released), find candidates that meet
particular color criteria and other criteria, and then look at all the visual and
near-IR images available for them in chrono order. And tell me the times of
the measurements too. Now the interesting part of this comes in looking at
the images. Some of them are big, like the DPOSS plates, and some are tiny,
like the 2MASS images. The little ones may well miss high PM candidates. I
know you guys have a mosaic engine. Can you stitch the smaller images
together so that I can see images of uniform size across many bands? The
images don’t all have to have the same sampling for this exercise – the scale
is what matters.
Requirements
This is just a very quick summary of the requirements implied by the above; this is
not intended as a formal statement of requirements
1. Input a single source of a list of sources. Source lists may be several hundred,
but no reason why someone won’t have lists of thousands.
2. Select attributes of images for searches
a. Wavelength range
b. Name specific data sets
c. Time ranges
3. Inputs by target name or position; former implies requirements on name
resolution; latter implies we need to be able to handle all common coord
systems.
4. Must support spatial areas of 30 deg on a size. I am inclined to think that the
search area we should support should be the spatial size of the Galactic Plane
– say 360 x 10 = 3600 deg sq
5. Searches need to be done and returned in real time. Need to figure out scale-
how may images can be found in these searches?
6. Must be able to package and download data.
7. Return tables of sources and their metadata
8. Ability to sort and filter images by metadata.
9. Must have uniform access to lots of data sets – includes all the “biggies” as far
as I can tell.
10. Overlay image footprints on images.
11. Visualize images – usual suspects of panning, zooming.
12. Ability to reproject images on to a common set of projections or scales
13. Capability to reproject images
14. Capability to create mosaics (or do cutouts of existing mosaics).
Design Notes
1. Input via a web form easiest.
2. Many of components are in place:
a. Need fast inventories of the type supported by the inventory service –
need to understand limits of its scalability.
b. Table filtering – if done in browser; how does it scale? Can it handle
the scale of results here
c. Various visualizers and footprint services
d. Reprojection and mosaicking codes.
3. Requirements imply uniform access to data – providers need to be willing to
expose metadata.
An Image Access Interface for the LSST – A “Proof-Of-Concept”
Prototype for a VAO Image Access Service
A service has been mounted at IPAC to support image access for the LSST, and
responds to specifications provided by LSST as part of the LSST Data Challenge 3b.
It offers access to many different image data sets from a simple web interface, where
users enter a position and radius, and are returned a list of images by project, image
metadata, and of course images. Users can visualize images with DS9.
Check it out at https://osiris.ipac.caltech.edu/cgi-bin/LSST/nph-lsst
Username: LSST; Password: Big-Sky. Intended for evaluation by VAO only; do not
distribute.
The JPEG below summarizes the features (from the LSSt Handbook).
Here are some design notes compiled by John Good, who led the development effort:
General Background
The basic function of the image inventory is to find images overlapping a location
or region as quickly as possible. Since the objects being searched are extended
and the search is multi-dimensional, this is best done with an indexing scheme
optimized for the problem (e.g. an R-Tree).
Note that at this level all images are essentially the same: bounded regions of the
sky. Multiple image sets, and indeed other data like spectral slit outlines, can
be combined into a single index, greatly speeding up cross-dataset queries.
In general, there is very little efficiency to be gained by trying to optimize
spatial searching with relational constraints on other image attributes. Given
both kinds of constraints, it is better to do one, then the other. In real
world applications, so long as the spatial constraint is actually useful
(e.g. not things like "everything not in the LMC") it is almost always predominant
and the relational constraints can be applied as a second pass. We do this using
in-line SQLite processing.
Salient Details
I'll give these in bullet-like form. I think in most cases the rationale is obvious:
Input to the process is a set of simple ASCII tables (VOTable, IPAC ASCII, etc.)
These can contain any relational information the supplier desires, though they
must contain coverage information either in the form of complete WCS parameters
or the "four corners" (RA and Dec J2000) coordinates of a bounding box. Usually
this bounding box is a good enough representation of the image/slit that no
further filtering is needed though obviously any sort of fancy footprint
post processing could be applied.
In addition to the spatial information, we require that at least one column in
each input table be a URL linking to the actual data file. There can in fact
be several URLs if desired (e.g. to a JPEG of the image, to a proposal abstract,
etc.)
These tables are converted to a fixed-length record format to speed extraction
of subsets of records later, though still in ASCII form for readability. FITS
tables could have been used but wouldn't improve efficiency anywhere.
Any collection of tables can be used together to create a single set of index
files. For the R-Tree, we need the flexibility of in-memory addressing but
these structures can take up many GBytes and we would rather not have to
keep this infrastructure running as daemon processes nor pay the startup
penalty this would imply. Instead, the index files have been memory-mapped
so they "load" instantly and rely on the OS paging for efficient memory
management. This gives us the best of both worlds (memory and files) since
we could, if we desired, put the data truly into resident memory by setting
up a RAM disk.
In order to allow for essentially unlimited scaling, we have in addition
implemented a multi-threaded multi-server layer on top of the basic R-Tree
service. While a single index could in principle be Terabytes in size it is
more convenient to break the data into pieces and parallelize the processing.
Besides keeping index file sizes reasonable, this allows us to separate
dynamic from static sets, etc.
There are four main searches:

Get a list of image sets matching the spatial constraint, with counts of the
matching images in each set.

Get the subset of image records for a given set for the above
region. This is where the fixed-length records above come into
play as they allow us to seek/read records as the index search
identifies them.

If the user has a table of locations, find how many of them
had matching data anywhere in the index (not too useful if
index covers many input tables but very useful if it is built
from a single source).

For a user table and a specific indexed dataset, make a list of
the user records where there are matches).
There are probably other searches that would be useful; those are the ones
that have had real application so far.
The LSST interface augments the above with GUI functionality do to obvious
post-processing:

Given a region, present the results of the first search above as a
little table.

Selecting on of the datasets, retrieve and show the records using the
second query. This tabular display allows for smooth paging/scrolling
for large numbers of records and can also sort and apply SQL filtering
to any set of columns.

The table display also presents URLs as links, so data can be directly
retrieved. Missing from this prototype is a multi-file retrieval
mechanism, though the table inself supports multiple record selection.
The two principle omissions are the bulk download capability and program-friendly
access to the relational filtering, both of which would be easy to implement.
At the moment, ingest is handled manually; what we really need is a simple
harvesting mechanism (locations and schedules where data can be retrieved; a
mechanism for identifying "new" records in datasets that are growing).
Download