VDFS Science Archives Wide Field Astronomy Unit, IfA, University of Edinburgh Status: WFCAM Science Archive has been successfully deployed and in operation for nearly two years now: http://surveys.roe.ac.uk/wsa Comprehensive design documentation: http://www.roe.ac.uk/~nch/wfcam Detailed SW documentation (incl. source code): http://www.roe.ac.uk/~rsc/wsa This talk: lessons learnt from WFCAM, including User Interface issues General archiving problems Apart from the data volume problem, we have • • • • • • • • > 90% of time spent chasing < 10% of data. Specification-creep “Schema evolution” Reprocessing and recalibration issues (versioning etc) Proprietary rights Integration into the (fledgling) VO Quality control issues The unforeseen …? ⇒ these present general operational challenges Dealing with screwy data For WFCAM, > 99.9% of data delivered by the DFS is fine; but how do we deal with the data when things go wrong? • Robust coding - clean exception handling - persistent and verbose error logging • Transactional support in the DBMS - all-or-nothing “atomic” batches - bespoke logging and UIDs for data modification events ⇒ Data modifications are easy to roll backwards Dealing with specification creep • Relational DBMS is the key to archive flexibility - SQL interface for applications and users • Adhere to standard self-describing formats for IO - FITS - VOTable Dealing with schema evolution For example, in it’s simplest form: “Please can table X have an additional attribute Y …” • SQL is not a dynamic programming language but is fundamental to the archive design ⇒ Data ingress/egress applications are schema driven: • Parsing of static SQL scripts • Set of tags to drive specific operations • Automated schema validation Dealing with versioning RDBMS is used to track all metadata, including those data pertaining to (re)processing and (re)calibration • Processing versions and history • Calibration coefficients • Deprecation flag ⇒ The (relational) data model explicitly allows for all of the above Dealing with proprietary rights • ESO (+ a few others) have 18 month proprietary rights on UKIDSS data • To gain access to UKIDSS data in the science archive you need to be registered (non-proprietary data can be accessed by anyone of course) • Users are organised into “community” groups & vetting of new users is delegated to the community contacts • AstroGRID community software has been deployed to aid integration into the VO • 862 WSA survey users in 94 different communities Dealing with the VO The UK (and international) VO projects are developing at the same time as the VDFS SAs are being deployed • Close contact maintained between development teams • VDFS was an “early adopter” of early releases of AstroGrid software • Standards are being adhered to as they become sufficiently mature - VOTable - Authorisation/authentication (e.g. “communities”) Dealing with Quality Control Inevitably, the UKIDSS survey consortium did not seriously consider final science data quality control until a couple of months prior to the first public release (the EDR). Quality Control: • is open—ended • is intensively interactive • requires great flexibility in data exploration ⇒ The WSA relational model and user interface coped well with these requirements. Dealing with …? During the WSA deployment phase there have been several cases of unforeseen issues arising. The archive system is designed for • flexibility • scalability • ease of maintenance ⇒ VDFS has reacted positively and effectively in those situations. Database design A normalised relational design: • logical arrangement of data • data model is easily communicated to users • end users exposed to full flexibility via SQL • Same relational model for VSA User Interface • Simple access modes for novice users • Query builders to educate users in the power of SQL • Free—form SQL interface exposes the flexibility of the relational model to the end user • Comprehensive online documentation - Schema browsers - SQL Cookbook - Release notes http://surveys.roe.ac.uk/wsa Browsing images: More sophisticated usage mode: a science query Fill in the SQL web form … … results page presented in seconds … … visualise at the push of a button Î BD cluster candidates User Interface issues • On the whole, SQL & RDBMS seem to do rather well … (see examples), BUT • Special use case #1: GPS cluster finding - ad-hoc bulk (100s GB) data export - compact binary files, limited attributes, • Special use case #2: very rare object search - Slightest contamination = big problem - Complex decisions made when filtering