VDFS Science Archives Wide Field Astronomy Unit, IfA, University of Edinburgh

advertisement
VDFS
Science Archives
Wide Field Astronomy Unit, IfA,
University of Edinburgh
Status:
WFCAM Science Archive has been successfully
deployed and in operation for nearly two years now:
http://surveys.roe.ac.uk/wsa
Comprehensive design documentation:
http://www.roe.ac.uk/~nch/wfcam
Detailed SW documentation (incl. source code):
http://www.roe.ac.uk/~rsc/wsa
This talk: lessons learnt from WFCAM, including User
Interface issues
General archiving problems
Apart from the data volume problem, we have
•
•
•
•
•
•
•
•
> 90% of time spent chasing < 10% of data.
Specification-creep
“Schema evolution”
Reprocessing and recalibration issues (versioning etc)
Proprietary rights
Integration into the (fledgling) VO
Quality control issues
The unforeseen …?
⇒ these present general operational challenges
Dealing with screwy data
For WFCAM, > 99.9% of data delivered by the DFS is
fine; but how do we deal with the data when things
go wrong?
• Robust coding
- clean exception handling
- persistent and verbose error logging
• Transactional support in the DBMS
- all-or-nothing “atomic” batches
- bespoke logging and UIDs for data modification
events
⇒ Data modifications are easy to roll backwards
Dealing with specification creep
• Relational DBMS is the key to archive flexibility
- SQL interface for applications and users
• Adhere to standard self-describing formats for IO
- FITS
- VOTable
Dealing with schema evolution
For example, in it’s simplest form:
“Please can table X have an additional attribute Y …”
• SQL is not a dynamic programming language
but is fundamental to the archive design
⇒ Data ingress/egress applications are
schema driven:
• Parsing of static SQL scripts
• Set of tags to drive specific operations
• Automated schema validation
Dealing with versioning
RDBMS is used to track all metadata, including those
data pertaining to (re)processing and (re)calibration
• Processing versions and history
• Calibration coefficients
• Deprecation flag
⇒ The (relational) data model explicitly allows for
all of the above
Dealing with proprietary rights
• ESO (+ a few others) have 18 month proprietary
rights on UKIDSS data
• To gain access to UKIDSS data in the science archive
you need to be registered (non-proprietary data can
be accessed by anyone of course)
• Users are organised into “community” groups &
vetting of new users is delegated to the community
contacts
• AstroGRID community software has been
deployed to aid integration into the VO
• 862 WSA survey users in 94 different communities
Dealing with the VO
The UK (and international) VO projects are developing
at the same time as the VDFS SAs are being deployed
• Close contact maintained between development
teams
• VDFS was an “early adopter” of early releases of
AstroGrid software
• Standards are being adhered to as they become
sufficiently mature
- VOTable
- Authorisation/authentication (e.g. “communities”)
Dealing with Quality Control
Inevitably, the UKIDSS survey consortium did not
seriously consider final science data quality control
until a couple of months prior to the first public
release (the EDR).
Quality Control:
• is open—ended
• is intensively interactive
• requires great flexibility in data exploration
⇒ The WSA relational model and user interface
coped well with these requirements.
Dealing with …?
During the WSA deployment phase there have been
several cases of unforeseen issues arising.
The archive system is designed for
• flexibility
• scalability
• ease of maintenance
⇒ VDFS has reacted positively and effectively in
those situations.
Database design
A normalised relational design:
• logical arrangement of data
• data model is easily communicated
to users
• end users exposed to full flexibility
via SQL
• Same relational model for VSA
User Interface
• Simple access modes for novice users
• Query builders to educate
users in the power of SQL
• Free—form SQL interface
exposes the flexibility of
the relational model to the
end user
• Comprehensive online documentation
- Schema browsers
- SQL Cookbook
- Release notes
http://surveys.roe.ac.uk/wsa
Browsing images:
More sophisticated usage mode: a
science query
Fill in the SQL web form …
… results page presented in seconds …
… visualise at the push of a button Î BD cluster candidates
User Interface issues
• On the whole, SQL & RDBMS seem to do rather well
… (see examples), BUT
• Special use case #1: GPS cluster finding
- ad-hoc bulk (100s GB) data export
- compact binary files, limited attributes,
• Special use case #2: very rare object search
- Slightest contamination = big problem
- Complex decisions made when filtering
Download