WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh VO as a Data Grid, NeSC ‘03 Background & context • Wide Field Astronomy: - large-scale public surveys - multi-colour, multi-epoch imaging data sets • Developments over recent decades: - whole-sky Schmidt telescope surveys (eg. SuperCOSMOS) - current generation optical/IR, eg. SDSS, WFCAM - next generation, eg. VISTA Prime examples of key datasets that will be the cornerstone of the VO datagrid VO as a Data Grid, NeSC ‘03 SuperCOSMOS scans photographic media: • 10 Gbyte/day • 3 colours: B, R & I • 1 colour (R) at 2 epochs • 0.7 “/pixel • 2 byte/pixel • whole sky • total data volume (pix): ~15 Tbyte • S hemisphere completed 2002 (N hemisphere by end 2005) VO as a Data Grid, NeSC ‘03 WFCAM will image the sky directly using IR sensitive detectors; deployment on a 4m telescope (UKIRT): • 100 Gbyte/night • 5 colours: ZYJHK; some multi-epoch imaging • 0.4 “/pixel • 4 byte/pixel • ~10% sky coverage in selected areas (various depths) • total data volume (pix): ~100 Tbyte • observations start in 2004; 7 yr programme planned VO as a Data Grid, NeSC ‘03 VISTA (also 4m) will have 4x as many IR detectors as WFCAM: • 500 Gbyte/night • 4 colours: zJHK • targeted surveys (various depths & areas) • 0.34 “/pixel • total data volume (pix): ~0.5 Pbyte • observations start at the end of 2006 VO as a Data Grid, NeSC ‘03 Characteristics of astronomy DBs (I) • pixel images processed into lists of parameterised detections known as “catalogues” (parameterised data typically <10% of pixel data volume) • detection association within survey data yielding multi-colour, multi-epoch source record Characteristics of astronomy DBs (II) • detailed (but relatively small) amount of descriptive data with images and catalogues • required to track descriptive data and images along with catalogue data • for current/future generation surveys processing and ingest dictated by observing patterns • but users require well defined, stable catalogue products on which to do their science hence require periodic release of stable, well-defined, read-only catalogues VO as a Data Grid, NeSC ‘03 Typical usages (I) • increasingly involve jointly querying different survey datasets in different databases -example shows stellar population discrimination using SDSS colours and SSA proper motions (Digby et al., astro-ph/0304056, MNRAS in print) VO as a Data Grid, NeSC ‘03 Typical usages (II) • position & proximity searches v. common - spatial indexing (2d, spherical geom.) required • statistical studies: ensemble characteristics of different species of source • one-in-a-million searches for peculiar sources with highly detailed, specific properties - whole table scans • …? => enable flexible interrogation to inspire new, innovative usage and promote new science VO as a Data Grid, NeSC ‘03 Science archive development at WFAU: • SSA: a few Tbytes • WSA = 10x SSA • VSA = 5x WSA approach is to set up a prototype archive system now (SSA), expand and implement WSA to coincide with WFCAM ops, then scale to VSA. VO as a Data Grid, NeSC ‘03 Database design: key requirements (I) Flexibility: • ingested data are rich in structure • daily ingest; daily/weekly/monthly curation • many varied usage modes • protect proprietorial rights • allow for changes/enhancements in design VO as a Data Grid, NeSC ‘03 Database design: key requirements (II) Scalability: • ~2 Tbytes of new data per year • operating lifetime > 5 years • maintain performance for increasing data volumes Portability: • V1.0/V2.0 phased approach to hardware/OS/DBMS VO as a Data Grid, NeSC ‘03 Database design: fundamentals (I) • RDBMS, not OODBMS • WSA V1.0: Windows/SQL Server (“SkyServer”) - V2.0 may be the same, DB2, or Oracle • Image data stored as external flat files, not BLOBs - but image metadata stored in DBMS • All attributes “not null”, ie. mandatory values • Archive curation information stored in DBMS VO as a Data Grid, NeSC ‘03 Database design: fundamentals (II) • Calibration coefficients stored for astrometry & photometry - instrumental quantities stored (XY in pix; flux in ADU) - calibrated quantities stored based on current calibration - all previous coefficients and versioning stored VO as a Data Grid, NeSC ‘03 Database design: fundamentals (III) • Reruns: reprocessed image data - same observations yield new source attribute values - re-ingest, but retain old parameterisation • Repeats: better measurements of the same source - eg. stacked image detections - again, retain old parameterisation • Duplicates: same source & filter but different observation - eg. overlap regions - store all data, and flag “best” VO as a Data Grid, NeSC ‘03 Hardware design (I) • separate servers for - pixels - catalogue curation - catalogue public access - web services • different hardware solutions - mass storage on IDE with HW RAID5 - high bandwidth catalogue servers using SCSI and SW RAID VO as a Data Grid, NeSC ‘03 Hardware design (II) • mass storage of pixels using low-cost IDE VO as a Data Grid, NeSC ‘03 Hardware design (III) • dual P4 Xeon server • independent PCI-X buses for maximum b/w • dual channel Ultra320 SCSI adapters High bandwidth catalogue server VO as a Data Grid, NeSC ‘03 Hardware design (IV) • individual Seagate 146 Gbyte disks sustain > 50 Mbyte/s sequential read • Ultra320 saturates at 200 Mbyte/s in one channel • 4 disks per channel • SW RAID striping across disks (following SkyServer design of Gray, Szalay & colleagues) VO as a Data Grid, NeSC ‘03 The SuperCOSMOS Science Archive (SSA) • WFCAM Science Archive prototype • Existing ad hoc flat file archive (inflexible, restricted access) re-implemented in an RDBMS • Catalogue data only (no image pixel data) • 1.3 Tbytes of catalogue data • Implement a working service for users & developers to exercise prior to arrival of Tbytes of WFCAM data VO as a Data Grid, NeSC ‘03 SSA has several similarities to WSA: • spatial indexing is required over celestial sphere • many source attributes in common, eg. position, brightness, colour, shape, … • multi-colour, multi-epoch detection information results from multiple measurements of the same source VO as a Data Grid, NeSC ‘03 Development method: “20 queries approach” • a set of real-world astronomical queries, expressed in SQL • includes joint queries between the SSA and SDSS Example: /* Q14: Provide a list of stars with multiple epoch measurements, which have light variations >0.5 mag. */ select objid into results from Source where (classR1=1 and classR2=1 and qualR1<128 and qualR2<128) and abs (bestmagR1-bestmagR2) > 0.5 VO as a Data Grid, NeSC ‘03 SSA relational model: • relatively simple • catalogues have ~256 byte records with mainly 4-byte attributes, ie. 50 to 60 per record • so 2 tables dominate the DB - Detection: 0.83 Tbyte - Source: 0.44 Tbyte VO as a Data Grid, NeSC ‘03 SSA has been implemented & data are being ingested: VO as a Data Grid, NeSC ‘03 WSA has significant differences, however: • catalogue and pixel data; • science – driven, nested survey programmes (as opposed to SSA “atlas” maps of whole sky) result in complex data structure; • curation & update within DBMS (whereas SSA is a finished data product ingested once into the DBMS). VO as a Data Grid, NeSC ‘03 WFCAM Science Archive : relational design VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Schematic picture of the WSA: • Pixels: - one flat – file image store; access layer restricts public access - filenames and all metadata are tracked in DBMS tables with unrestricted access • Catalogues: - WFAU incremental (no public access) - Public, released DBs - external survey datasets also held VO as a Data Grid, NeSC ‘03 Image metadata relational model • Programme & Field => vital • library calibration frames stored & related • primary/extension HDU keys logically stored & related • this will work for VISTA VO as a Data Grid, NeSC ‘03 Astrometric and photometric calibration data: • require to store calibration information • recalibration is required – esp. photometric • old calibration coefficients must be stored • time-dependence (versioning) complicates the relational model Calibration data are related to images; source detections are related to images and hence their relevant calibration data VO as a Data Grid, NeSC ‘03 Image calibration data: • “set-ups” define nightly detector & filter combinations: - extinctions have nightly values - zps have detector & nightly values • coefficients split into current & previous entities • Versioning & timing recorded • highly non-linear systematics are allowed for via 2D maps Catalogue data: general model • related back through progenitor image to calibration data • detection list for each programme (or set of sub-surveys) • merged source entity is maintained • merge events recorded • list re-measurements derived VO as a Data Grid, NeSC ‘03 Non-WFCAM data: general model • each non-WFCAM survey has a stored catalogue (currently locally stored). • cross-neighbour table: - records nearby sources between any two surveys - yields associated (“nearest”) source VO as a Data Grid, NeSC ‘03 Example: UKIDSS LAS & relationship to SDSS • UKIDSS LAS overlaps with SDSS • list measurements: - at positions defined by IR source, but in optical image data; - do not currently envisage implementing this the other way (ie. optical source positions placed in IR image data) VO as a Data Grid, NeSC ‘03 Curation: – set of entities to track in-DBMS processing: • archived programmes have: - required filter set - required join(s) - required list – driven measurement product(s) - release date(s) - final curation task - one or more curation timestamps • a set of curation procedures is defined for the archive VO as a Data Grid, NeSC ‘03 WFCAM Science Archive: V1.0 schema implementation VO as a Data Grid, NeSC ‘03 Implementation: unique identifiers (UIDs) • meaningful UIDs, not arbitrary DBMS-assigned sequence no. • following relational model, compound UIDs from appropriate attributes, eg. - detection UID is a combination of sequence no. on detector and detector UID - detector UID is a combination of extension no. of detector and multiframe UID • but: top-level UIDs compounded into new attribute to avoid copying many columns down the relational hierarchy, eg. - meaningful multiframe UID is made up from UKIRT run no., and observation and ingest dates. VO as a Data Grid, NeSC ‘03 Implementation: SQL Server database picture (I) • Multiframe & nearest neighbour tables VO Design as a Data Grid, NeSC ‘03 Critical Review , April 2003 Implementation: SQL Server database picture (II) • UKIDSS LAS & nearest neighbour tables VO as a Data Grid, NeSC ‘03 Implementation: spatial index attributes • Hierarchical Triangular Mesh algorithm (courtesy of P. Kunszt, A. Szalay & colleagues) • HTM attribute HTMID for each occurrence of RA & Dec • SkyServer functions & stored procedures: - spHTM_Lookup, spHTM_Cover, spHTM_To_String, fHTM_Cover etc. VO as a Data Grid, NeSC ‘03 Implementation: table indexing • standard RDBMS practice: index tables on commonly used fields • one “clustered” index per table based on primary key (default) - results in re-ordering of data on disk • further non-clustered indices: - when indexing on more than one field, put in order of decreasing selectivity - HTM index attribute is included as most selective in at least one non-clustered index on appropriate tables - index files stored on different disk volumes to tables to help minimise disk “thrashing” = > experimentation required with real astronomical data and queries: SSA prototype VO as a Data Grid, NeSC ‘03 User interface & Grid context (I) • “traditional” interfaces (ftp/http), eg. existing implementations: WWW from interface Access via CDS Aladin tool VO as a Data Grid, NeSC ‘03 User interface & Grid context (II) • SQL form interfaces: VO as a Data Grid, NeSC ‘03 User interface & Grid context (III) • web services under development (XML/SOAP/VOtable) • other data (eg. SDSS, 2MASS, …) mirrored locally initially • but aspiration is to enable usages employing distributed resources (both data and CPU) ultimately recast web services as Grid services to integrate WSA into the VO Data Grid VO as a Data Grid, NeSC ‘03