Incorporating Theory Data into the Virtual Observatory L.D. Shaw1 , N.A. Walton 1 1 Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge, CB3 0HA Abstract We describe work investigating how astronomical simulation data can be effectively incorporated into the Virtual Observatory. We focus specifically on determining whether the data query, access and retrieval standards being developed by the International Virtual Observatory Alliance for observational data can be similarly applied to, and fully support, the requirements of data extracted from astrophysical simulations. We present a data model for Simulation data and identify the extensions required to the Universal Content Descriptor astronomy metadata vocabulary to encompass simulation specific concepts and quantities. We also describe initial work on a new standard protocol to enable uniform access to simulation datasets. 1. Introduction Over the last decade, the way in which we do observational astronomy has started to change. The slow, independant and uncoordinated manner in which data was accumulated in the past, through individual and unrelated observing programs, has been replaced by systematic and methodical projects aiming to map the sky in unprecedented detail. Advances in telescope, detector and computer technology have enabled us to explore the universe in a systematic and detailed way, in multiple wave-bands and at rapidly improving resolution. Individual ground and space based survey telescopes are currently producing tera-bytes of data. However, these instruments are merely precursors of those currently being designed and built. In direct correspondance to Moore’s law, the rate at which astronomical data is collected doubles roughly every year and a half. Consequently, Astronomy faces a data avalanche, and the astronomical community is confronted with a challenge: how can this huge flood of data be exploited to its maximum potential? How can we federate the disjointed archives of survey data and the vast numbers of small data-sets from individual observations and enable access to them through a uniform inferface? A solution to these challenges would provide a new and powerfull tool with which to probe the universe. It was with these challenges in mind that the concept of a Virtual Observatory (VO) was conceived. The VO is a system in which the vast astronomical archives and databases around the world, together with analysis tools and computational services, are linked together into an integrated service. In reality, a number of national virtual observatories around the world are being developed concurrently (in the UK, Astrogrid [1]), each dealing with the data archives of their host astronomical community. In order to ensure that the Virtual Observatories around the world are able to interoperate, an international body was formed in 2001 – the International Virtual Observatory Alliance (IVOA) – to determine the standards to which individual virtual observatories must adhere to. Over the past few years, significant progress have been made by the IVOA in developing standards and protocols for astronomical data storage, discovery and retrieval. Observational astronomy is not the only area of the field undergoing a data revolution. With the advent of parallel and grid computing and rapidly improving hardware, computational techniques in astrophysics and cosmology have become important tools in evaluating highly complex systems, from stellar evolution to the formation of galaxies and clusters of galaxies. However, there is currently no support for sim- ulation data and services within the framework of VO standards. In this paper, we describe initial work investigating the new protocols and the changes to existing standards required to enable the incorporation of simulated datasets into the VO. 2. Standards for Simulation Data Simulations are frequently used today in all areas of astronomy; from the birth, evolution and death of stars and planetary systems, to the formation of galaxies in which they reside and the formation of large scale structures, dark matter halos, in which galaxies themselves are thought to form. There is much variety in the processes being investigated and the underlying physics that govern them. Many different approaches have been chosen to tackle each problem, often employing very different algorithms to deal with the complex physics involved. There is clearly a huge amount of information that must be recorded in order to fully describe a simulation and its results, not all of which can be quantified in numerical terms. In order for simulated data to be included in the Virtual Observatory, we must first clearly identify all the different components that describe a simulated dataset. This is the purpose of defining an abstract data model for simulations (see Sec. 2.1). At the IVOA interoperability meeting in Kyoto, the Theory Interest Group - charged with ensuring that the VO standards also meet the requirements of theoretical (or simulated) data outlined a set of near-term targets with the aim of identifying where the existing IVOA standards and implentations must be updated to allow the discovery, exchange and analysis of simulated data. Computational cosmology is one example of an area of astronomy that should be a major benficiary of an international virtual observatory. Many independant groups are working towards solving the problems of hierarchical structure formation, performing largescale simulations as an integral part of their investigations. However, the results of many of these simulations are not publicly available. Furthermore, of those that are, the data is stored in a wide variety of (mostly undocumented) formats and systems, often chosen having been convenient at the time. Therefore the interchange and direct comparison of results by independant groups is uncommon as often much effort is required in obtaining, understanding and translating data in order for it to be of any use. A primary goal of the IVOA’s Theory Interest group is thus to decide upon a standard file format for raw simulation data and a metadata language (based on the Universal Content Descriptors used for observational data [4]) with which to describe the contents. Based on this, a set of requirements of the standards being developed by the Data Access Layer (DAL) and Virtual Observatory Query Language (VOQL) working groups within the IVOA have been identified so that simulation data can seamlessly be discovered and retrieved through VO portals. In this section, we describe the key standards for data discovery, querying and retrieval that have been developed and approved by the IVOA, and discuss whether, in their current form, they fully support the incorporation of simulated data in the VO, as outlined above. Although the standards discussed here do not cover the full range of those that are being developed by the IVOA, they are those that are most relevant to the differences between observed and simulated data. 2.1. Data Modelling and Metadata In order for simulated data to be included in the Virtual Observatory, we must first clearly identify all the different components that describe a simulated dataset. To this purpose, initial attempts have been made to construct an abstract data model for simulations. In Figure 1 we demonstrate the current iteration of the ‘Simulation’ data model. This model was developed using the corresponding data model defined for observational data, Observation [2], as our starting point, modifying it where necessary. It is hoped that this will ensure that Simulation has a similar overall structure to Observation, differing only in the detail. There are two purposes to this approach. Firstly, it is hoped that a similarity between data models will aid the process of comparing simulated and observed datasets. Secondly, it maintains the possibility of defining an overall data model for astronomical data, real or synthetic. A simulation can essentially be broken down into three main categories: Observation Data, Characterisation and Provenance. Observation Data describes the units and dimension of the data. It inherits from the Quantity data model (currently in development [3]) which assigns the units and metadata to either single or arrays of values. Characterisation describes not only the ranges over which each measured quantity SIMULATION Curation CHARACTERISATION SIM. DATA QUANTITY COVERAGE Bounds RESOLUTION Phys. Params PROVENANCE THEORY Algorithms COMPUTATION Resources OBJECTIVE Tech. Params Coord/Area Figure 1: The principle components of the Simulation data model (see text) is valid, but also how precise and how accurate they are. It is composed of Coverage and Resolution. These represent the different parameters constraining the data. Resolution describes the scales at which it is believed the simulation results begin to become significantly influenced by errors due to numerical effects, for example, due to the finite size and mass of simulation gridpoints or particles. Coverage describes the area of the Characterisation parameter space that the simulation occupies. It is itself composed of Bounds, which describes the range of values occupied by the simulation data, and PhysicalParameters, which consists of the set of physical constants that define ‘the Universe’ occupied by the simulation. The third major component of Simulation is Provenance, which describes how the data was created. It consists of Theory, Computation and Objective. Theory represents a description of the underlying physical laws in the simulation. It is expected to consist of a reference to a publication or resource describing the simulation. Computation describes the technical aspect of the simulation and has three components – Algorithms, TechnicalParameters and Resources. Algorithm describes the sequence of numerical techniques used to evolve the simulation from one state to the next. It is expected that this also will contain a reference to a published paper or resource. TechnicalParameters are quantities representing the inputs to the algorithms, such as ‘number of particles’ and ‘box size’. Resources describe the specifications of the hard- ware on which the simulation was performed. Objective describes the overall purpose of the simulation – what was the purpose of the simulation? What were the phenomenon that is was performed to investigate? 2.2. Metadata: UCDs for Simulations In order to describe the quantities and concepts that are being outlined in the data models and published in data archives, a restricted vocabulary – Universal Content Descriptors (UCD [4]) – is being developed and controlled by the IVOA. UCDs are not designed to allocate units or names to quantites, they are meant to describe “what the unit is”. The overal purpose of UCDs is to provide a standard means of describing astronomical quantities – whether it be the luminosity of a galaxy or the exposure time of the instrument with which is was observed – using a restricted vocabulary (to prevent the proliferation of words), whilst retaining the flexibility to enable precise, non-ambiguous descriptions of the vast range of quantities that occur in astronomical datasets. Their main goal is to ensure interoperability between heterogeneous datasets. If an astronomer is searching for catalogues containing a specific quantity, she can use the UCD for that quantity to locate all those that contain it, whether it was the main purpose of the observation or not. UCDs consist of a string constructed by combining ‘words’ seperated by semi-colons from the controlled vocabulary. Individual words may be composed of several ‘atoms’ seperated by a period (.). The order of these atoms induces a hierarchy, where the following atom is a specific instance of its predecessor. Sequences of words each representing a specific concept are then combined to provide a describe of what the actual quantity is. For example, stat.error;phot.mag;em.opt.V would refer to the error on the measurement of the photometric magnitude (brightness) of an object in the Vband of the optical region of the electromagnetic spectrum. However, untill now the UCD tree has been defined to deal specifically with observational related quantities. Although there exists the flexibility to describe the physical quantities measured from simulations (as these are often the properties of the objects that were being simulated), it is not currently possible to describe the properties and parameters of the simulations themselves. This includes some input physical parameters (i.e. cosmological parameters) that define the theoretical context of the simulation, and technical parameters that define its size, scope and resolution (e.g number of particles, length of simulation box side, time/redshift of a simulation output). We have therefore proposed a new branch of the UCD tree to encompass computatational techniques in astronomy; ‘comp’. This branch can be used to describe both astrophysical (and cosmological) simulations, and data reduction and postprocessing algorithms for both simulation and observational data. Astrophysical and cosmological simulations clearly cover an enormous range of scales and processes. Consequently, there is no absolute spatial scale for simulations, in the way that there is for astronomical image data (position on the sky, distance from earth). The most basic function that a SNAP service must provide is direct access to the raw particle data (or grid state) from a simulation output via an access URL. However, due do rapidly improving hardware and fast and efficient parallel codes, the resolution, and therefore the filesize of a single snapshot can be extremely large. For example, many cosmological simulations (e.g. [5]) now contain 10243 particles within the simulation box. The file containing their positions and velocities will then be at least 20GB per timestep. It is therefore important to stress that, contrary to the procedure for retrieving observational images and data, simulation data cannot be retrieved via http with some kind of encoding for the binaries (e.g. base64) since this is extremely expensive operation when large datasets are being handled. FTP, or ideally GridFTP, must be used to retrieve the higher resolution simulations. The SNAP standard is currently in the process of being defined and agreed through the IVOA. It is expected that v1.0 will be in place after the September 2006 Interoperability meeting. Acknowledgements 2.3. Simple Numerical Access Protocol One of the key objectives of the Virtual Observatory is to provide uniform interfaces to the different forms of astronomical data. This is now being realised with the development of a standard means of retrieving astronomical images and spectra. We are now developing a prototype standard for retrieving raw simulation data from a variety of astronomical simulation repositories – Simple Numerical Access Protocol (SNAP). SNAP is designed to enable uniform access to ‘raw’ simulation data. This includes the retrieval of all the particles or grid points within the simulation box at a particular timestep (known as a ‘snapshot’), a specified sub-volume of a simulation (all the particles/grid-points within a certain region), or data from a post-processed simulation. The latter typically includes catalogues of objects that have been identified within a larger simulation or within a suite of simulations. LDS is support by a PPARC e-science studentship. References [1] Astrogrid: www.astrogrid.org [2] http://www.ivoa.net/internal/IVOA/IvoaDataModel/obs.v0.2.pdf [3] http://www.ivoa.net/twiki/bin/view/IVOA/IVOADMQuantityWP [4] http://www.ivoa.net/twiki/bin/view/IVOA/IvoaUCD [5] Bode, P., Ostriker, J.P. 2003, ApJS, 145, 1 [6] http://www.ivoa.net/twiki/bin/view/IVOA/IvoaTheory