Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit

advertisement
Data provenance in
astronomy
Bob Mann
Wide-Field Astronomy Unit
University of Edinburgh
(rgm@roe.ac.uk)
Outline
 Data and databases in astronomy
 Case Study: UK Infrared Deep Sky Survey
 Conclusions
2/24
Outline
 Data and databases in astronomy
 Case Study: UK Infrared Deep Sky Survey
 Conclusions
3/24
Astronomers observe across the
whole electromagnetic spectrum
 Galaxy images look different across spectrum, due to:
 Inherent angular resolution of the telescope
 Different emission processes
4/24
Astronomical data: original form
 Different detector technologies used across
the spectrum, yielding different types of
data: e.g.
 Ultraviolet/optical/infrared
 Image: array of pixel values
 X-ray
 Event list: positions, arrival times, energies of all
detected photons
 Radio
 Interferometric visibilities: sparse Fourier
transform of a region of the sky
5/24
Astronomical data: final form
 Most research done using catalogue data
 i.e. tables of attributes of detected sources –
mainly discrete sources (stars, galaxies, etc)
 Data compression
 Catalogue – few % of image data volume
 Amenable to representation in relational DB
 Natural indexing by location in sky
 …but original data products (images,
spectra, event lists) sometimes needed
6/24
Astronomical databases
 Telescope archives
 Heterogeneous collections of raw data files
from all observations taken
 Download data for reduction and analysis
 Sky survey archives
 Homogeneous data and pipeline reduction
 “Science Archive” – do science on DB
 Bibliographic archives – scans of journals
7/24
Astronomical data processing
 Data reduction
 Remove instrumental signatures from raw data
and produce “science-ready” data
 Software packages written for specific instruments
 Data analysis
 Derive scientific results from science-ready data
products – e.g. statistical analyses
 Some astro-specific packages/environments – e.g. IRAF
 Some use of programming languages
 Fortran, C/C++, Python, Java
 Some use of commercial packages
 e.g. Interactive Data Language (IDL)
8/24
Outline
 Data and databases in astronomy
 Case Study: UKIDSS
 Introduction to UKIDSS
 Data life-cycle in UKIDSS
 Provenance in UKIDSS
 Conclusions
9/24
UK Infrared Deep Sky Survey
 Set of five infrared sky surveys
 Covering ~1/6 of the sky
 From large/shallow to
very small/very deep
 See www.ukidss.org
 Observations: 2005-2012 using Wide Field
Camera (WFCAM) on UK Infrared Telescope
(UKIRT) in Hawaii
10/24
UKIDSS data life-cycle (1)
 Summit of Mauna Kea




Data acquired from 4 WFCAM detectors
Summit pipeline: instrument health
Data written to LTO tape in NDF format
Tapes couriered to Cambridge weekly
 Cambridge
 Raw data converted from NDF to FITS
 Data reduction pipeline run on nightly basis: ~100Gb/night
 Remove instrumental signatures, combine images,
detect and classify objects, calibrate positions & fluxes
11/24
UKIDSS data life-cycle (2)
 Edinburgh
 Ingest data from Cambridge:
catalogues into RDBMS; image
metadata into RDBMS; images on disk
 Combine data from multiple nights: generate new
catalogues from stacked images
 Prepare release databases for WFCAM Science Archive
(WSA): see http://surveys.roe.ac.uk/wsa
 Users worldwide
 Extract raw images from Cambridge
 Extract image and catalogues in FITS files from Edinburgh
 Run queries on catalogues & image metadata in WSA
12/24
Provenance in UKIDSS
 Why is provenance important in UKIDSS?
 What provenance information is recorded?
 How will this be used?...and by whom?
 …and is this adequate?
13/24
Importance of provenance
 Much UKIDSS science is rare object search
Objects with these
colours would be very
unusual – and possibly
very interesting.
Ratio of fluxes
in H & K bands
Are they real?
Need ability to trace
back to reduced image
within which object
was detected – maybe
back to raw image.
Ratio of fluxes
in J & H bands
14/24
Structure of a FITS file
Primary Header
Primary Data Array
Extensions
Header
Data
Header: composed
of 80-character
ASCII records
Data units can be
images or tables
Header
Data
15/24
FITS header records
 Almost all records of the form
KEYWORD = ‘ value ‘ / COMMENT
 Some standard keywords defined, but
considerable freedom to define new ones
 Relevant metadata for particular instruments
 Amongst standard set is HISTORY
 Format: HISTORY free text
 Provenance information can be stored in a
series of HISTORY records
16/24
UKIDSS FITS files (1)
 Raw image files
 Primary header: telescope/instrument
set-up, observing conditions, target,
observational parameters
 Primary data array: empty
 Extensions: (header,data) pairs for each of four
detectors: header has detector-specific metadata;
data is compressed image
 Header keywords defined in Interface Control
Document between Hawaii & Cambridge
17/24
UKIDSS FITS files (2)
 Reduced image files
 Primary header & data array: metadata
propagated from raw data file
 Headers of extensions include HISTORY
records for data reduction steps run at Cambridge,
e.g
HISTORY
HISTORY
HISTORY
HISTORY
HISTORY
HISTORY
HISTORY
HISTORY
20060615 17:30:02
$Id: cir_stage1.c,v 1.11 2005/12/15 14:44:04 jim Exp $
20060615 17:31:04
$Id: cir_qblkmed.c,v 1.9 2005/08/12 14:35:19 jim Exp $
20060615 17:32:36
$Id: cir_xtalk.c,v 1.5 2005/10/17 14:58:50 jim Exp $
20060615 20:01:58
$Id: cir_arith.c,v 1.8 2005/02/25 10:14:55 jim Exp $
What
When
Who
18/24
UKIDSS FITS files (3)
 Catalogue files
 Primary header: metadata propagated
from raw image
 Primary data array: empty
 Headers of extensions include metadata for
catalogue generation process – invocations of
software modules in HISTORY records, with
parameter values in separate records
 Header keywords for both reduced images and
catalogues are defined in an Interface Control
Document between Cambridge & Edinburgh
19/24
User access to provenance info
 All header records from all FITS files
ingested into WSA except HISTORY records
 So, users can track provenance through
queries against WSA, and can get HISTORY
records by downloading files
 Hopefully enough to determined
whether unusual object is real,
but this is this good enough?
20/24
Recap:
Astronomical data processing
 Data reduction
 Remove instrumental signatures from raw data
and produce “science-ready” data
 Software packages written for specific instruments
 Data analysis
 Derive scientific results from science-ready data
products – e.g. statistical analyses
 Some astro-specific packages/environments – e.g. IRAF
 Some use of programming languages
 Fortran, C/C++, Python, Java
 Some use of commercial packages
 e.g. Interactive Data Language (IDL)
21/24
?
Provenance in data analysis:
Two main problems
 Less controlled software environment
 Little bits of code written for a specific analysis,
not tried and tested pipeline modules
 Use of data from many sources
 UKIDSS/WSA is state-of-the-art for provenance
 Many (esp. older) data resources not so good
 Provenance of combined dataset only as good as
provenance of worst constituent dataset?
22/24
Does this matter?
 Provenance information for data analysis is
recorded in the journal paper (sort of)
 Improving links between online literature and
data sources
 Increasing importance of large sky surveys
with well controlled environments
 Moving more of the data analysis from the user’s
desktop to the data centre
23/24
Conclusions
 Modern sky survey systems record & publish
extensive provenance for data reduction
 Very little provenance recorded from data
analysis – except description in journal paper
 More could surely be done – but would
researchers support overhead of doing so?
 Improvements as more analysis in data centre
 Could/should we be doing more?
24/24
Download