Incorporating Theory Data into the Virtual Observatory L.D. Shaw , N.A. Walton

advertisement
Incorporating Theory Data into the Virtual
Observatory
L.D. Shaw1 , N.A. Walton
1
1
Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge, CB3 0HA
Abstract
We describe work investigating how astronomical simulation data can be effectively
incorporated into the Virtual Observatory. We focus specifically on determining whether
the data query, access and retrieval standards being developed by the International
Virtual Observatory Alliance for observational data can be similarly applied to, and
fully support, the requirements of data extracted from astrophysical simulations. We
present a data model for Simulation data and identify the extensions required to the
Universal Content Descriptor astronomy metadata vocabulary to encompass simulation
specific concepts and quantities. We also describe initial work on a new standard protocol
to enable uniform access to simulation datasets.
1. Introduction
Over the last decade, the way in which we do
observational astronomy has started to change.
The slow, independant and uncoordinated manner in which data was accumulated in the past,
through individual and unrelated observing programs, has been replaced by systematic and methodical projects aiming to map the sky in unprecedented detail. Advances in telescope, detector and computer technology have enabled
us to explore the universe in a systematic and
detailed way, in multiple wave-bands and at
rapidly improving resolution. Individual ground
and space based survey telescopes are currently
producing tera-bytes of data. However, these
instruments are merely precursors of those currently being designed and built. In direct correspondance to Moore’s law, the rate at which
astronomical data is collected doubles roughly
every year and a half.
Consequently, Astronomy faces a data
avalanche, and the astronomical community is
confronted with a challenge: how can this huge
flood of data be exploited to its maximum potential? How can we federate the disjointed
archives of survey data and the vast numbers
of small data-sets from individual observations
and enable access to them through a uniform
inferface? A solution to these challenges would
provide a new and powerfull tool with which to
probe the universe.
It was with these challenges in mind that the
concept of a Virtual Observatory (VO) was conceived. The VO is a system in which the vast
astronomical archives and databases around the
world, together with analysis tools and computational services, are linked together into an integrated service. In reality, a number of national
virtual observatories around the world are being
developed concurrently (in the UK, Astrogrid
[1]), each dealing with the data archives of their
host astronomical community. In order to ensure that the Virtual Observatories around the
world are able to interoperate, an international
body was formed in 2001 – the International
Virtual Observatory Alliance (IVOA) – to determine the standards to which individual virtual observatories must adhere to. Over the past
few years, significant progress have been made
by the IVOA in developing standards and protocols for astronomical data storage, discovery
and retrieval.
Observational astronomy is not the only area
of the field undergoing a data revolution. With
the advent of parallel and grid computing
and rapidly improving hardware, computational
techniques in astrophysics and cosmology have
become important tools in evaluating highly
complex systems, from stellar evolution to the
formation of galaxies and clusters of galaxies.
However, there is currently no support for sim-
ulation data and services within the framework
of VO standards. In this paper, we describe
initial work investigating the new protocols and
the changes to existing standards required to
enable the incorporation of simulated datasets
into the VO.
2. Standards for Simulation Data
Simulations are frequently used today in all areas of astronomy; from the birth, evolution and
death of stars and planetary systems, to the formation of galaxies in which they reside and the
formation of large scale structures, dark matter
halos, in which galaxies themselves are thought
to form. There is much variety in the processes
being investigated and the underlying physics
that govern them. Many different approaches
have been chosen to tackle each problem, often employing very different algorithms to deal
with the complex physics involved. There is
clearly a huge amount of information that must
be recorded in order to fully describe a simulation and its results, not all of which can be
quantified in numerical terms. In order for simulated data to be included in the Virtual Observatory, we must first clearly identify all the
different components that describe a simulated
dataset. This is the purpose of defining an abstract data model for simulations (see Sec.
2.1).
At the IVOA interoperability meeting in Kyoto, the Theory Interest Group - charged with
ensuring that the VO standards also meet the
requirements of theoretical (or simulated) data outlined a set of near-term targets with the aim
of identifying where the existing IVOA standards and implentations must be updated to allow the discovery, exchange and analysis of simulated data. Computational cosmology is one
example of an area of astronomy that should
be a major benficiary of an international virtual observatory. Many independant groups are
working towards solving the problems of hierarchical structure formation, performing largescale simulations as an integral part of their investigations.
However, the results of many of these simulations are not publicly available. Furthermore,
of those that are, the data is stored in a wide
variety of (mostly undocumented) formats and
systems, often chosen having been convenient at
the time. Therefore the interchange and direct
comparison of results by independant groups is
uncommon as often much effort is required in
obtaining, understanding and translating data
in order for it to be of any use. A primary goal
of the IVOA’s Theory Interest group is thus to
decide upon a standard file format for raw simulation data and a metadata language (based
on the Universal Content Descriptors used for
observational data [4]) with which to describe
the contents. Based on this, a set of requirements of the standards being developed by the
Data Access Layer (DAL) and Virtual Observatory Query Language (VOQL) working groups
within the IVOA have been identified so that
simulation data can seamlessly be discovered
and retrieved through VO portals.
In this section, we describe the key standards for data discovery, querying and retrieval
that have been developed and approved by the
IVOA, and discuss whether, in their current
form, they fully support the incorporation of
simulated data in the VO, as outlined above.
Although the standards discussed here do not
cover the full range of those that are being developed by the IVOA, they are those that are most
relevant to the differences between observed and
simulated data.
2.1. Data Modelling and Metadata
In order for simulated data to be included in
the Virtual Observatory, we must first clearly
identify all the different components that describe a simulated dataset. To this purpose,
initial attempts have been made to construct
an abstract data model for simulations. In Figure 1 we demonstrate the current iteration of
the ‘Simulation’ data model. This model was
developed using the corresponding data model
defined for observational data, Observation [2],
as our starting point, modifying it where necessary. It is hoped that this will ensure that Simulation has a similar overall structure to Observation, differing only in the detail. There
are two purposes to this approach. Firstly, it
is hoped that a similarity between data models
will aid the process of comparing simulated and
observed datasets. Secondly, it maintains the
possibility of defining an overall data model for
astronomical data, real or synthetic.
A simulation can essentially be broken down
into three main categories: Observation Data,
Characterisation and Provenance. Observation
Data describes the units and dimension of the
data. It inherits from the Quantity data model
(currently in development [3]) which assigns the
units and metadata to either single or arrays
of values. Characterisation describes not only
the ranges over which each measured quantity
SIMULATION
Curation
CHARACTERISATION
SIM. DATA
QUANTITY
COVERAGE
Bounds
RESOLUTION
Phys. Params
PROVENANCE
THEORY
Algorithms
COMPUTATION
Resources
OBJECTIVE
Tech. Params
Coord/Area
Figure 1: The principle components of the Simulation data model (see text)
is valid, but also how precise and how accurate
they are. It is composed of Coverage and Resolution. These represent the different parameters
constraining the data. Resolution describes the
scales at which it is believed the simulation results begin to become significantly influenced by
errors due to numerical effects, for example, due
to the finite size and mass of simulation gridpoints or particles. Coverage describes the area
of the Characterisation parameter space that
the simulation occupies. It is itself composed
of Bounds, which describes the range of values
occupied by the simulation data, and PhysicalParameters, which consists of the set of physical
constants that define ‘the Universe’ occupied by
the simulation.
The third major component of Simulation is
Provenance, which describes how the data was
created. It consists of Theory, Computation and
Objective. Theory represents a description of
the underlying physical laws in the simulation.
It is expected to consist of a reference to a publication or resource describing the simulation.
Computation describes the technical aspect of
the simulation and has three components – Algorithms, TechnicalParameters and Resources.
Algorithm describes the sequence of numerical
techniques used to evolve the simulation from
one state to the next. It is expected that this
also will contain a reference to a published paper or resource. TechnicalParameters are quantities representing the inputs to the algorithms,
such as ‘number of particles’ and ‘box size’. Resources describe the specifications of the hard-
ware on which the simulation was performed.
Objective describes the overall purpose of the
simulation – what was the purpose of the simulation? What were the phenomenon that is was
performed to investigate?
2.2. Metadata: UCDs for Simulations
In order to describe the quantities and concepts
that are being outlined in the data models and
published in data archives, a restricted vocabulary – Universal Content Descriptors (UCD
[4]) – is being developed and controlled by the
IVOA. UCDs are not designed to allocate units
or names to quantites, they are meant to describe “what the unit is”. The overal purpose
of UCDs is to provide a standard means of describing astronomical quantities – whether it be
the luminosity of a galaxy or the exposure time
of the instrument with which is was observed
– using a restricted vocabulary (to prevent the
proliferation of words), whilst retaining the flexibility to enable precise, non-ambiguous descriptions of the vast range of quantities that occur in astronomical datasets. Their main goal
is to ensure interoperability between heterogeneous datasets. If an astronomer is searching
for catalogues containing a specific quantity, she
can use the UCD for that quantity to locate all
those that contain it, whether it was the main
purpose of the observation or not.
UCDs consist of a string constructed by combining ‘words’ seperated by semi-colons from
the controlled vocabulary. Individual words
may be composed of several ‘atoms’ seperated
by a period (.). The order of these atoms induces a hierarchy, where the following atom
is a specific instance of its predecessor. Sequences of words each representing a specific
concept are then combined to provide a describe
of what the actual quantity is. For example,
stat.error;phot.mag;em.opt.V would refer to the
error on the measurement of the photometric
magnitude (brightness) of an object in the Vband of the optical region of the electromagnetic
spectrum.
However, untill now the UCD tree has been
defined to deal specifically with observational
related quantities. Although there exists the
flexibility to describe the physical quantities
measured from simulations (as these are often
the properties of the objects that were being
simulated), it is not currently possible to describe the properties and parameters of the simulations themselves. This includes some input
physical parameters (i.e. cosmological parameters) that define the theoretical context of the
simulation, and technical parameters that define its size, scope and resolution (e.g number of particles, length of simulation box side,
time/redshift of a simulation output). We have
therefore proposed a new branch of the UCD
tree to encompass computatational techniques
in astronomy; ‘comp’. This branch can be used
to describe both astrophysical (and cosmological) simulations, and data reduction and postprocessing algorithms for both simulation and
observational data.
Astrophysical and cosmological simulations
clearly cover an enormous range of scales and
processes. Consequently, there is no absolute
spatial scale for simulations, in the way that
there is for astronomical image data (position
on the sky, distance from earth). The most basic function that a SNAP service must provide
is direct access to the raw particle data (or grid
state) from a simulation output via an access
URL.
However, due do rapidly improving hardware
and fast and efficient parallel codes, the resolution, and therefore the filesize of a single snapshot can be extremely large. For example, many
cosmological simulations (e.g. [5]) now contain
10243 particles within the simulation box. The
file containing their positions and velocities will
then be at least 20GB per timestep. It is therefore important to stress that, contrary to the
procedure for retrieving observational images
and data, simulation data cannot be retrieved
via http with some kind of encoding for the binaries (e.g. base64) since this is extremely expensive operation when large datasets are being handled. FTP, or ideally GridFTP, must
be used to retrieve the higher resolution simulations.
The SNAP standard is currently in the process of being defined and agreed through the
IVOA. It is expected that v1.0 will be in place
after the September 2006 Interoperability meeting.
Acknowledgements
2.3. Simple Numerical Access Protocol
One of the key objectives of the Virtual Observatory is to provide uniform interfaces to the
different forms of astronomical data. This is
now being realised with the development of a
standard means of retrieving astronomical images and spectra. We are now developing a
prototype standard for retrieving raw simulation data from a variety of astronomical simulation repositories – Simple Numerical Access
Protocol (SNAP). SNAP is designed to enable uniform access to ‘raw’ simulation data.
This includes the retrieval of all the particles
or grid points within the simulation box at a
particular timestep (known as a ‘snapshot’), a
specified sub-volume of a simulation (all the
particles/grid-points within a certain region), or
data from a post-processed simulation. The latter typically includes catalogues of objects that
have been identified within a larger simulation
or within a suite of simulations.
LDS is support by a PPARC e-science studentship.
References
[1] Astrogrid: www.astrogrid.org
[2] http://www.ivoa.net/internal/IVOA/IvoaDataModel/obs.v0.2.pdf
[3] http://www.ivoa.net/twiki/bin/view/IVOA/IVOADMQuantityWP
[4] http://www.ivoa.net/twiki/bin/view/IVOA/IvoaUCD
[5] Bode, P., Ostriker, J.P. 2003, ApJS, 145, 1
[6] http://www.ivoa.net/twiki/bin/view/IVOA/IvoaTheory
Download