Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk) Outline Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey Conclusions 2/24 Outline Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey Conclusions 3/24 Astronomers observe across the whole electromagnetic spectrum Galaxy images look different across spectrum, due to: Inherent angular resolution of the telescope Different emission processes 4/24 Astronomical data: original form Different detector technologies used across the spectrum, yielding different types of data: e.g. Ultraviolet/optical/infrared Image: array of pixel values X-ray Event list: positions, arrival times, energies of all detected photons Radio Interferometric visibilities: sparse Fourier transform of a region of the sky 5/24 Astronomical data: final form Most research done using catalogue data i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc) Data compression Catalogue – few % of image data volume Amenable to representation in relational DB Natural indexing by location in sky …but original data products (images, spectra, event lists) sometimes needed 6/24 Astronomical databases Telescope archives Heterogeneous collections of raw data files from all observations taken Download data for reduction and analysis Sky survey archives Homogeneous data and pipeline reduction “Science Archive” – do science on DB Bibliographic archives – scans of journals 7/24 Astronomical data processing Data reduction Remove instrumental signatures from raw data and produce “science-ready” data Software packages written for specific instruments Data analysis Derive scientific results from science-ready data products – e.g. statistical analyses Some astro-specific packages/environments – e.g. IRAF Some use of programming languages Fortran, C/C++, Python, Java Some use of commercial packages e.g. Interactive Data Language (IDL) 8/24 Outline Data and databases in astronomy Case Study: UKIDSS Introduction to UKIDSS Data life-cycle in UKIDSS Provenance in UKIDSS Conclusions 9/24 UK Infrared Deep Sky Survey Set of five infrared sky surveys Covering ~1/6 of the sky From large/shallow to very small/very deep See www.ukidss.org Observations: 2005-2012 using Wide Field Camera (WFCAM) on UK Infrared Telescope (UKIRT) in Hawaii 10/24 UKIDSS data life-cycle (1) Summit of Mauna Kea Data acquired from 4 WFCAM detectors Summit pipeline: instrument health Data written to LTO tape in NDF format Tapes couriered to Cambridge weekly Cambridge Raw data converted from NDF to FITS Data reduction pipeline run on nightly basis: ~100Gb/night Remove instrumental signatures, combine images, detect and classify objects, calibrate positions & fluxes 11/24 UKIDSS data life-cycle (2) Edinburgh Ingest data from Cambridge: catalogues into RDBMS; image metadata into RDBMS; images on disk Combine data from multiple nights: generate new catalogues from stacked images Prepare release databases for WFCAM Science Archive (WSA): see http://surveys.roe.ac.uk/wsa Users worldwide Extract raw images from Cambridge Extract image and catalogues in FITS files from Edinburgh Run queries on catalogues & image metadata in WSA 12/24 Provenance in UKIDSS Why is provenance important in UKIDSS? What provenance information is recorded? How will this be used?...and by whom? …and is this adequate? 13/24 Importance of provenance Much UKIDSS science is rare object search Objects with these colours would be very unusual – and possibly very interesting. Ratio of fluxes in H & K bands Are they real? Need ability to trace back to reduced image within which object was detected – maybe back to raw image. Ratio of fluxes in J & H bands 14/24 Structure of a FITS file Primary Header Primary Data Array Extensions Header Data Header: composed of 80-character ASCII records Data units can be images or tables Header Data 15/24 FITS header records Almost all records of the form KEYWORD = ‘ value ‘ / COMMENT Some standard keywords defined, but considerable freedom to define new ones Relevant metadata for particular instruments Amongst standard set is HISTORY Format: HISTORY free text Provenance information can be stored in a series of HISTORY records 16/24 UKIDSS FITS files (1) Raw image files Primary header: telescope/instrument set-up, observing conditions, target, observational parameters Primary data array: empty Extensions: (header,data) pairs for each of four detectors: header has detector-specific metadata; data is compressed image Header keywords defined in Interface Control Document between Hawaii & Cambridge 17/24 UKIDSS FITS files (2) Reduced image files Primary header & data array: metadata propagated from raw data file Headers of extensions include HISTORY records for data reduction steps run at Cambridge, e.g HISTORY HISTORY HISTORY HISTORY HISTORY HISTORY HISTORY HISTORY 20060615 17:30:02 $Id: cir_stage1.c,v 1.11 2005/12/15 14:44:04 jim Exp $ 20060615 17:31:04 $Id: cir_qblkmed.c,v 1.9 2005/08/12 14:35:19 jim Exp $ 20060615 17:32:36 $Id: cir_xtalk.c,v 1.5 2005/10/17 14:58:50 jim Exp $ 20060615 20:01:58 $Id: cir_arith.c,v 1.8 2005/02/25 10:14:55 jim Exp $ What When Who 18/24 UKIDSS FITS files (3) Catalogue files Primary header: metadata propagated from raw image Primary data array: empty Headers of extensions include metadata for catalogue generation process – invocations of software modules in HISTORY records, with parameter values in separate records Header keywords for both reduced images and catalogues are defined in an Interface Control Document between Cambridge & Edinburgh 19/24 User access to provenance info All header records from all FITS files ingested into WSA except HISTORY records So, users can track provenance through queries against WSA, and can get HISTORY records by downloading files Hopefully enough to determined whether unusual object is real, but this is this good enough? 20/24 Recap: Astronomical data processing Data reduction Remove instrumental signatures from raw data and produce “science-ready” data Software packages written for specific instruments Data analysis Derive scientific results from science-ready data products – e.g. statistical analyses Some astro-specific packages/environments – e.g. IRAF Some use of programming languages Fortran, C/C++, Python, Java Some use of commercial packages e.g. Interactive Data Language (IDL) 21/24 ? Provenance in data analysis: Two main problems Less controlled software environment Little bits of code written for a specific analysis, not tried and tested pipeline modules Use of data from many sources UKIDSS/WSA is state-of-the-art for provenance Many (esp. older) data resources not so good Provenance of combined dataset only as good as provenance of worst constituent dataset? 22/24 Does this matter? Provenance information for data analysis is recorded in the journal paper (sort of) Improving links between online literature and data sources Increasing importance of large sky surveys with well controlled environments Moving more of the data analysis from the user’s desktop to the data centre 23/24 Conclusions Modern sky survey systems record & publish extensive provenance for data reduction Very little provenance recorded from data analysis – except description in journal paper More could surely be done – but would researchers support overhead of doing so? Improvements as more analysis in data centre Could/should we be doing more? 24/24