VDFS Science Archives Wide Field Astronomy Unit, IfA, University of Edinburgh Status: WFCAM Science Archive has been successfully deployed and in operation for four years http://surveys.roe.ac.uk/wsa VISTA Science Archive prototype(s!) up and running over the past year http://surveys.roe.ac.uk/vsa Comprehensive design documentation: http://www.roe.ac.uk/~nch/wfcam Detailed SW documentation (incl. source code): http://www.roe.ac.uk/~rsc/wsa This talk: overview and some interface issues General archiving problems Apart from the data volume problem, we have • • • • • • • • > 90% of time spent chasing < 10% of data. Specification-creep “Schema evolution” Reprocessing and recalibration issues (versioning etc) Proprietary rights Integration into the (fledgling) VO Quality control issues The unforeseen …? ⇒ these present general operational challenges Dealing with screwy data For WFCAM, > 99.9% of data delivered by the DFS is fine; but how do we deal with the data when things go wrong? • Robust coding - clean exception handling - persistent and verbose error logging • Transactional support in the DBMS - all-or-nothing “atomic” batches - bespoke logging and UIDs for data modification events ⇒ Data modifications are easy to roll backwards Dealing with specification creep • Relational DBMS is the key to archive flexibility - SQL interface for applications and users • Adhere to standard self-describing formats for IO - FITS - VOTable Dealing with schema evolution For example, in it’s simplest form: “Please can table X have an additional attribute Y …” • SQL is not a dynamic programming language but is fundamental to the archive design ⇒ Data ingress/egress applications are schema driven: • Parsing of static SQL scripts • Set of tags to drive specific operations • Automated schema validation Dealing with versioning RDBMS is used to track all metadata, including those data pertaining to (re)processing and (re)calibration • Processing versions and history • Calibration coefficients • Deprecation flag ⇒ The (relational) data model explicitly allows for all of the above Dealing with proprietary rights • ESO (+ a few others) have 18 month proprietary rights on WFCAM/UKIDSS data; survey teams have 12 months proprietary rights on VISTA survey data • To gain access to data in the science archives you need to be registered (non-proprietary data can be accessed by anyone of course) • Users are organised into “community” groups & vetting of new users is delegated to the community contacts • AstroGRID community software has been deployed to aid integration into the VO Dealing with the VO The UK (and international) VO projects are developing at the same time as the VDFS SAs are being deployed • Close contact maintained between development teams • VDFS was an “early adopter” of early releases of AstroGrid software • Standards are being adhered to as they become sufficiently mature - VOTable - Authorisation/authentication (e.g. “communities”) Dealing with Quality Control Inevitably, survey consortia do not seriously consider final science data quality control until a couple of months prior to the first public releases. Quality Control: • is open—ended • is intensively interactive • requires great flexibility in data exploration ⇒ The WSA/VSA relational model and user interface cope well with these requirements. Dealing with …? During the WSA/VSA deployment phases there have been several cases of unforeseen issues arising. The archive system is designed for • flexibility • scalability • ease of maintenance ⇒ VDFS has reacted positively and effectively in those situations. Database design A normalised relational design: • logical arrangement of data • data model is easily communicated to users • end users exposed to full flexibility via SQL User Interface • Simple access modes for novice users • Query builders to educate users in the power of SQL • Free—form SQL interface exposes the flexibility of the relational model to the end user • Comprehensive online documentation - Schema browsers - SQL Cookbook - Release notes http://surveys.roe.ac.uk/wsa Browsing images: More sophisticated usage mode: a science query Fill in the SQL web form … … results page presented in seconds … … visualise at the push of a button Î BD cluster candidates • Facilities also provided to allow users to upload small lists of objects for crossmatching (5x105 rows in ~103 sec) • The real power of the survey datasets comes from cross‐querying in multi‐wavelength space • Require implementations that scale to 109 rows (or more!) Large catalogue databases at WFAU: http://surveys.roe.ac.uk/ssa 20 (1.5) TB images (catalogues) 6.4x109 detections merged to 1.9x109 rows http://surveys.roe.ac.uk/wsa 50 (1.3) TB (Data Release 7) 2.4x109 detections merged to 7.7x108 rows http://surveys.roe.ac.uk/vsa >100 (>10) TB; >1010 rows … & local copies of 2MASS (4.7x108 rows); SDSS DRs (4.5x108 rows at DR7); … Typical science usage involves merged multi-colour, multi-colour source list(s) Consider two lists, each has ~N sources. • • • • sequential scan of one list is a linear O(N) process naïve pairing is exponential: O(N2) – not recommended! (binary) searching a sorted list is a logarithmic O(log N) process pairing (sequential scan of first and searching the second for each) is then log-linear: O(N log N) … requires one list to be sorted: • (quick)sorting one list is also an O(N log N) process Even with efficient implementation, must pre-join lists (generally cannot be done “on the fly”, or on demand, at the largest scale) Careful: lists may have - different passband - different epoch … … - different resolution - different positional accuracy - different cataloguing SW Between multi-passes within a survey: fields or frame sets Single passband detections associated into merged sources -Handshake pairing to minimise spurious associations -Fields delimit the sizes of the master & slave lists being searched Generalised source association (Szalay & Gray): the neighbour table Enables - environment of each source to be examined - unpaired detections to be investigated -… Between source lists from different surveys: cross-neighbour table Example usage: optical/infrared galaxy catalogue (bright red ellipticals; fainter, bluer star forming and/or spirals; still fainter, bluer dwarfs) Another example: proper motions between 2MASS & UKIDSS: Further use of the neighbour table: merged source list / single passband Neighbours (e.g. “synoptic” survey) - See Cross et al. MN, 399, 1730, (2009) for (many!) more details… Implementation details for SSA / WSA / VSA: Source merging: divide the problem into fields or frame sets • “outgest” from the RDBMS • pairing and merging in C/C++ • ingest results back into the RDBMS Bulk cross-neighbour computation • outgest two tables with UID, RA, Dec • use a “plane sweep” algorithm in C/C++ for efficiency – see Devereux et al., 2005, ASP Conf Ser. 347, 346 and also http://www.ict.csiro.au/staff/robert.power/projects/CM/ps/cm.htm (we use “IndexedActiveList” and “cross-match Filter” options) Source code available at http://www.roe.ac.uk/~rsc/wsa Also SQL neighbours: use Dec zones, sort on RA within the zones. • Jim Gray’s relational implementation of a plane sweep like algorithm • employs filter / refine optimisations • enhanced for minimal transaction logging Benchmarks for neighbour / cross-neighbour production: UKIDSS LAS DR8: lasSource (7.0x107 rows): 1.7x108 neighbours within 10” in 260s CPU lasSourceXDR7PhotoObj (4.5x108 rows): 1.2x108 cross-neighbours in 1500s lasSourceXtwomass_psc (4.7x108 rows): 1.3x107 cross-neighbours in 430s (Quad core Intel Xeon, 2.2GHz with 12GB RAM) UKIDSS GPS DR7: gpsSource (6.7x108 rows): 1.1x108 neighbours within 1” in 2400s gpsSourceXtwomass_psc : 8.0x107 cross-neighbours in 2800s (Dual core AMD Opteron CPU , 2GHz with 6GB RAM) (WFAU’s largest neighbour table is currently on SSA Source, with 3.9x109 rows)