We present the Indra suite of cosmological N -body simulations and the design of its companion database. Indra consists of 512 different instances of a 1 Gpc/h-sided box, each with 512 3 dark matter particles and the same input cosmology, enabling a characterization of very large-scale modes of the matter power spectrum with 10 12 M
Sun particle mass and an excellent handle on cosmic variance. We discuss the database design for the particle data, consisting of the positions and velocities of each particle, and the FOF halos, with links to the particle data so that halo properties can be calculated within the database.
512 different random instances, each 1 Gpc/h
Dark Matter only, WMAP7 cosmology
Particle mass of ~10 12 M
Sun
512 3 particles and 64 snapshots per simulation
Over 100 TB worth of data
In addition to the snapshots of particle data and the FOF halos calculated as the simulation runs, we output directly the complex Fourier amplitudes for all the large wavelength modes at 256 time-steps in order to study the mildly non-linear regime.
Classification of topologies (walls, filaments, clusters)
Studies of the baryon acoustic features in redshift-space
Large scale structure statistics, correlation functions, etc.
Mildly non-linear mode statistics and in-fall patterns
And much more…
The main feature of the database is that once the simulations have ended, it will contain all of the particle data for each of the 64 snapshots for each of the 512 simulation runs, totaling over 100 TB of data. We are currently testing particle tables with and without the use of SqlArrays (see below).
for each run, 1 row per particle per snapshot
for each run, 1 row per PH-index per snapshot
We will also include FOF halos that are being calculated as the simulation runs. Because we have all of the particle data, we can calculate any halo properties within the database itself or create new halo catalogs using different halo-finding algorithms.
for each run, 1 row per halo per snapshot
The particle tables will be indexed according to snapshot number and
Peano-Hilbert key, which describes the x-, y-, and z-coordinates of particles according to a space-filling curve. This will enable very fast spatial searches as well as light cones for mock galaxy catalogs generated on the fly.
Peano-Hilbert curves of different resolution
Many tables will exploit SqlArray functions developed by László Dobos et al. for Microsoft SQL Server 2008. Some features of the SqlArrays include:
Flexible manipulation of arrays of the major data types (incl. complex)
Optimized storage for very small (point data) and very big (data grid) arrays
Data stored as binary with a short header
Array size up to 2 GB
Common math libraries (LAPACK, FFTW) are wrapped and callable from inside SQL
-- Select particles with velocity > 1500 km/s: select id, sqrt(vx*vx+vy*vy+vz*vz) as v from snapnoarr -- 1.2 million rows where sqrt(vx*vx+vy*vy+vz*vz) > 1500. and snapnum = 11 -- 2 minutes
-- Select particles within a specified sphere: select id from snapnoarr where (x-500.)*(x-500.)+(y-500.)*(y-500.)+(z-500.)*(z-500.) <= 100.
and snapnum = 63 -- 1.5 minutes
-- As above, but accounting for PBCs and using spatial indexing: declare @qshape Shape3D = Shape3D::newInstance(‘Sphere [@x,@y,@z,@r]'); with fs as (select * from dbo.fSimulationCoverShape(@sim,@qshape,6)) select * from snapnoarr p inner join fs on p.phkey between fs.keymin and fs.keymax
-- search using PH-key where @qshape.ContainsPoint(x+fs.Shiftx,y+shifty,z+shiftz) = 1 and fs.FullOnly = 0 and p.snapnum = 63 -- seconds
-- Calculate the number of halos and halo particles in each snapshot: select snapnum, count(*) nhalo, SUM(numpart) npart from foftable group by snapnum order by snapnum -- compare to reading 32*64 data files
-- Get initial positions of particles in a particular halo (uses SqlArrays): with q as (select b.v from
(select partid from foftable where snapnum = 63 and haloid = 320) a cross apply BigIntArrayMax.ToTable(a.partid) b -- decompose array of IDs
) select p.id, p.x, p.y, p.z from snapnoarr p inner join q on p.id = q.v where p.snapnum = 0 -- minutes
The database design and simulation runs are currently an ongoing process. For example, we are (or will be): developing partitioning schemes that would allow the most common queries to be run on the data in parallel creating example queries that would demonstrate how to use
SQL and SqlArrays, as was done for SDSS and Millennium
designing an automated bulk-loading process streamlining the use of SqlArray functions incorporating on-the-fly data visualization
Finally, we plan to make the database available online.
Bridget Falck bfalck@pha.jhu.edu