Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit

Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk) Outline  Data and databases in astronomy  Case Study: UK Infrared Deep Sky Survey  Conclusions 2/24 Outline  Data and databases in astronomy  Case Study: UK Infrared Deep Sky Survey  Conclusions 3/24 Astronomers observe across the whole electromagnetic spectrum  Galaxy images look different across spectrum, due to:  Inherent angular resolution of the telescope  Different emission processes 4/24 Astronomical data: original form  Different detector technologies used across the spectrum, yielding different types of data: e.g.  Ultraviolet/optical/infrared  Image: array of pixel values  X-ray  Event list: positions, arrival times, energies of all detected photons  Radio  Interferometric visibilities: sparse Fourier transform of a region of the sky 5/24 Astronomical data: final form  Most research done using catalogue data  i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc)  Data compression  Catalogue – few % of image data volume  Amenable to representation in relational DB  Natural indexing by location in sky  …but original data products (images, spectra, event lists) sometimes needed 6/24 Astronomical databases  Telescope archives  Heterogeneous collections of raw data files from all observations taken  Download data for reduction and analysis  Sky survey archives  Homogeneous data and pipeline reduction  “Science Archive” – do science on DB  Bibliographic archives – scans of journals 7/24 Astronomical data processing  Data reduction  Remove instrumental signatures from raw data and produce “science-ready” data  Software packages written for specific instruments  Data analysis  Derive scientific results from science-ready data products – e.g. statistical analyses  Some astro-specific packages/environments – e.g. IRAF  Some use of programming languages  Fortran, C/C++, Python, Java  Some use of commercial packages  e.g. Interactive Data Language (IDL) 8/24 Outline  Data and databases in astronomy  Case Study: UKIDSS  Introduction to UKIDSS  Data life-cycle in UKIDSS  Provenance in UKIDSS  Conclusions 9/24 UK Infrared Deep Sky Survey  Set of five infrared sky surveys  Covering ~1/6 of the sky  From large/shallow to very small/very deep  See www.ukidss.org  Observations: 2005-2012 using Wide Field Camera (WFCAM) on UK Infrared Telescope (UKIRT) in Hawaii 10/24 UKIDSS data life-cycle (1)  Summit of Mauna Kea     Data acquired from 4 WFCAM detectors Summit pipeline: instrument health Data written to LTO tape in NDF format Tapes couriered to Cambridge weekly  Cambridge  Raw data converted from NDF to FITS  Data reduction pipeline run on nightly basis: ~100Gb/night  Remove instrumental signatures, combine images, detect and classify objects, calibrate positions & fluxes 11/24 UKIDSS data life-cycle (2)  Edinburgh  Ingest data from Cambridge: catalogues into RDBMS; image metadata into RDBMS; images on disk  Combine data from multiple nights: generate new catalogues from stacked images  Prepare release databases for WFCAM Science Archive (WSA): see http://surveys.roe.ac.uk/wsa  Users worldwide  Extract raw images from Cambridge  Extract image and catalogues in FITS files from Edinburgh  Run queries on catalogues & image metadata in WSA 12/24 Provenance in UKIDSS  Why is provenance important in UKIDSS?  What provenance information is recorded?  How will this be used?...and by whom?  …and is this adequate? 13/24 Importance of provenance  Much UKIDSS science is rare object search Objects with these colours would be very unusual – and possibly very interesting. Ratio of fluxes in H & K bands Are they real? Need ability to trace back to reduced image within which object was detected – maybe back to raw image. Ratio of fluxes in J & H bands 14/24 Structure of a FITS file Primary Header Primary Data Array Extensions Header Data Header: composed of 80-character ASCII records Data units can be images or tables Header Data 15/24 FITS header records  Almost all records of the form KEYWORD = ‘ value ‘ / COMMENT  Some standard keywords defined, but considerable freedom to define new ones  Relevant metadata for particular instruments  Amongst standard set is HISTORY  Format: HISTORY free text  Provenance information can be stored in a series of HISTORY records 16/24 UKIDSS FITS files (1)  Raw image files  Primary header: telescope/instrument set-up, observing conditions, target, observational parameters  Primary data array: empty  Extensions: (header,data) pairs for each of four detectors: header has detector-specific metadata; data is compressed image  Header keywords defined in Interface Control Document between Hawaii & Cambridge 17/24 UKIDSS FITS files (2)  Reduced image files  Primary header & data array: metadata propagated from raw data file  Headers of extensions include HISTORY records for data reduction steps run at Cambridge, e.g HISTORY HISTORY HISTORY HISTORY HISTORY HISTORY HISTORY HISTORY 20060615 17:30:02 $Id: cir_stage1.c,v 1.11 2005/12/15 14:44:04 jim Exp $ 20060615 17:31:04 $Id: cir_qblkmed.c,v 1.9 2005/08/12 14:35:19 jim Exp $ 20060615 17:32:36 $Id: cir_xtalk.c,v 1.5 2005/10/17 14:58:50 jim Exp $ 20060615 20:01:58 $Id: cir_arith.c,v 1.8 2005/02/25 10:14:55 jim Exp $ What When Who 18/24 UKIDSS FITS files (3)  Catalogue files  Primary header: metadata propagated from raw image  Primary data array: empty  Headers of extensions include metadata for catalogue generation process – invocations of software modules in HISTORY records, with parameter values in separate records  Header keywords for both reduced images and catalogues are defined in an Interface Control Document between Cambridge & Edinburgh 19/24 User access to provenance info  All header records from all FITS files ingested into WSA except HISTORY records  So, users can track provenance through queries against WSA, and can get HISTORY records by downloading files  Hopefully enough to determined whether unusual object is real, but this is this good enough? 20/24 Recap: Astronomical data processing  Data reduction  Remove instrumental signatures from raw data and produce “science-ready” data  Software packages written for specific instruments  Data analysis  Derive scientific results from science-ready data products – e.g. statistical analyses  Some astro-specific packages/environments – e.g. IRAF  Some use of programming languages  Fortran, C/C++, Python, Java  Some use of commercial packages  e.g. Interactive Data Language (IDL) 21/24 ? Provenance in data analysis: Two main problems  Less controlled software environment  Little bits of code written for a specific analysis, not tried and tested pipeline modules  Use of data from many sources  UKIDSS/WSA is state-of-the-art for provenance  Many (esp. older) data resources not so good  Provenance of combined dataset only as good as provenance of worst constituent dataset? 22/24 Does this matter?  Provenance information for data analysis is recorded in the journal paper (sort of)  Improving links between online literature and data sources  Increasing importance of large sky surveys with well controlled environments  Moving more of the data analysis from the user’s desktop to the data centre 23/24 Conclusions  Modern sky survey systems record & publish extensive provenance for data reduction  Very little provenance recorded from data analysis – except description in journal paper  More could surely be done – but would researchers support overhead of doing so?  Improvements as more analysis in data centre  Could/should we be doing more? 24/24

Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit

Related documents

Products

Support

Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib