VDFS Science Archives  Wide Field Astronomy Unit, IfA, University of Edinburgh

advertisement
VDFS
Science Archives Wide Field Astronomy Unit, IfA,
University of Edinburgh
Status:
WFCAM Science Archive has been successfully
deployed and in operation for four years
http://surveys.roe.ac.uk/wsa
VISTA Science Archive prototype(s!) up and running
over the past year
http://surveys.roe.ac.uk/vsa
Comprehensive design documentation:
http://www.roe.ac.uk/~nch/wfcam
Detailed SW documentation (incl. source code):
http://www.roe.ac.uk/~rsc/wsa
This talk: overview and some interface issues
General archiving problems
Apart from the data volume problem, we have
•
•
•
•
•
•
•
•
> 90% of time spent chasing < 10% of data.
Specification-creep
“Schema evolution”
Reprocessing and recalibration issues (versioning etc)
Proprietary rights
Integration into the (fledgling) VO
Quality control issues
The unforeseen …?
⇒ these present general operational challenges
Dealing with screwy data
For WFCAM, > 99.9% of data delivered by the DFS is
fine; but how do we deal with the data when things
go wrong?
• Robust coding
- clean exception handling
- persistent and verbose error logging
• Transactional support in the DBMS
- all-or-nothing “atomic” batches
- bespoke logging and UIDs for data modification
events
⇒ Data modifications are easy to roll backwards
Dealing with specification creep
• Relational DBMS is the key to archive flexibility
- SQL interface for applications and users
• Adhere to standard
self-describing
formats for IO
- FITS
- VOTable
Dealing with schema evolution
For example, in it’s simplest form:
“Please can table X have an additional attribute Y …”
• SQL is not a dynamic programming language
but is fundamental to the archive design
⇒ Data ingress/egress applications are
schema driven:
• Parsing of static SQL scripts
• Set of tags to drive specific operations
• Automated schema validation
Dealing with versioning
RDBMS is used to track all metadata, including those
data pertaining to (re)processing and (re)calibration
• Processing
versions
and history
• Calibration
coefficients
• Deprecation
flag
⇒ The (relational) data model explicitly allows for
all of the above
Dealing with proprietary rights • ESO (+ a few others) have 18 month proprietary
rights on WFCAM/UKIDSS data; survey teams have
12 months proprietary rights on VISTA survey data
• To gain access to data in the science archives
you need to be registered (non-proprietary data can
be accessed by anyone of course)
• Users are organised into “community” groups &
vetting of new users is delegated to the community
contacts
• AstroGRID community software has been
deployed to aid integration into the VO
Dealing with the VO
The UK (and international) VO projects are developing
at the same time as the VDFS SAs are being deployed
• Close contact maintained between development
teams
• VDFS was an “early adopter” of early releases of
AstroGrid software
• Standards are being adhered to as they become
sufficiently mature
- VOTable
- Authorisation/authentication (e.g. “communities”)
Dealing with Quality Control
Inevitably, survey consortia do not seriously consider
final science data quality control until a couple of
months prior to the first public releases.
Quality Control:
• is open—ended
• is intensively interactive
• requires great flexibility in data exploration
⇒ The WSA/VSA relational model and user interface
cope well with these requirements.
Dealing with …?
During the WSA/VSA deployment phases there have
been several cases of unforeseen issues arising.
The archive system is designed for
• flexibility
• scalability
• ease of maintenance
⇒ VDFS has reacted positively and effectively in
those situations.
Database design
A normalised relational design:
• logical arrangement of data
• data model is easily communicated
to users
• end users exposed to full flexibility
via SQL
User Interface
• Simple access modes for novice users
• Query builders to educate
users in the power of SQL
• Free—form SQL interface
exposes the flexibility of
the relational model to the
end user
• Comprehensive online documentation
- Schema browsers
- SQL Cookbook
- Release notes
http://surveys.roe.ac.uk/wsa
Browsing images:
More sophisticated usage mode: a science query
Fill in the SQL web form …
… results page presented in seconds …
… visualise at the push of a button Î BD cluster candidates
• Facilities also provided to allow users to upload small lists of objects for crossmatching
(5x105 rows in ~103 sec)
• The real power of the survey datasets comes from cross‐querying in multi‐wavelength space
• Require implementations that scale to 109
rows (or more!)
Large catalogue databases at WFAU:
http://surveys.roe.ac.uk/ssa
20 (1.5) TB images (catalogues) 6.4x109 detections merged to 1.9x109 rows
http://surveys.roe.ac.uk/wsa
50 (1.3) TB (Data Release 7) 2.4x109 detections merged to 7.7x108 rows
http://surveys.roe.ac.uk/vsa
>100 (>10) TB; >1010 rows
…
& local copies of 2MASS (4.7x108 rows); SDSS DRs (4.5x108 rows at DR7); …
Typical science usage involves merged multi-colour, multi-colour source list(s)
Consider two lists, each has ~N sources.
•
•
•
•
sequential scan of one list is a linear O(N) process
naïve pairing is exponential: O(N2) – not recommended!
(binary) searching a sorted list is a logarithmic O(log N) process
pairing (sequential scan of first and searching the second for each) is then
log-linear: O(N log N)
… requires one list to be sorted:
• (quick)sorting one list is also an O(N log N) process
Even with efficient implementation, must pre-join lists
(generally cannot be done “on the fly”, or on demand, at the largest scale)
Careful:
lists may have
- different passband
- different epoch
…
…
- different resolution
- different positional
accuracy
- different cataloguing
SW
Between multi-passes within a survey: fields or frame sets
Single passband detections associated into merged sources
-Handshake pairing to minimise spurious associations
-Fields delimit the sizes of the master & slave lists being searched
Generalised source association (Szalay & Gray): the neighbour table
Enables
- environment of each source to be examined
- unpaired detections to be investigated
-…
Between source lists from different surveys: cross-neighbour table
Example usage: optical/infrared galaxy catalogue
(bright red ellipticals; fainter, bluer
star forming and/or spirals; still
fainter, bluer dwarfs)
Another example: proper motions
between 2MASS & UKIDSS:
Further use of the neighbour table: merged source list / single passband
Neighbours (e.g. “synoptic” survey)
- See Cross et al. MN, 399, 1730, (2009) for (many!) more details…
Implementation details for SSA / WSA / VSA:
Source merging: divide the problem into fields or frame sets
• “outgest” from the RDBMS
• pairing and merging in C/C++
• ingest results back into the RDBMS
Bulk cross-neighbour computation
• outgest two tables with UID, RA, Dec
• use a “plane sweep” algorithm in C/C++ for efficiency – see
Devereux et al., 2005, ASP Conf Ser. 347, 346 and also
http://www.ict.csiro.au/staff/robert.power/projects/CM/ps/cm.htm
(we use “IndexedActiveList” and “cross-match Filter” options)
Source code available at http://www.roe.ac.uk/~rsc/wsa
Also SQL neighbours: use Dec zones, sort on RA within the zones.
• Jim Gray’s relational implementation of a plane sweep like algorithm
• employs filter / refine optimisations
• enhanced for minimal transaction logging
Benchmarks for neighbour / cross-neighbour production:
UKIDSS LAS DR8:
lasSource (7.0x107 rows): 1.7x108 neighbours within 10” in 260s CPU
lasSourceXDR7PhotoObj (4.5x108 rows): 1.2x108 cross-neighbours in 1500s
lasSourceXtwomass_psc (4.7x108 rows): 1.3x107 cross-neighbours in 430s
(Quad core Intel Xeon, 2.2GHz with 12GB RAM)
UKIDSS GPS DR7:
gpsSource (6.7x108 rows): 1.1x108 neighbours within 1” in 2400s
gpsSourceXtwomass_psc : 8.0x107 cross-neighbours in 2800s
(Dual core AMD Opteron CPU , 2GHz with 6GB RAM)
(WFAU’s largest neighbour table is currently on SSA Source, with 3.9x109 rows)
Download