Scientific data management for big computers and big data National Center for Supercomputing Applications University of Illinois at Urbana-Champaign http://hdf.ncsa.uiuc.edu/HDF5/ Lawrence Livermore Answering big questions … Matter & the universe involves big data … National Laboratory Life and nature Other HDF5 sponsors include Simulation of a NIF laser beam passing through a plasma. University of Illinois Density gradient in the plasma causes the laser beam to self-focus and then split up into several "filaments". Simulation by Bert Still, Visualization by Steve Langer, LLNL Weather and climate August 24, 2001 NASA A15-projector display wall (resolution 6400 x 3072) for viewing interactive applications and pre-computed animations at Lawrence Livermore National Laboratory. Courtesy of Arthur Mirin, LLNL on big computers. National Science Foundation DOE SciDAC August 24, 2002 Total Column Ozone (Dobson) 60 385 610 HDF5 runs on almost all computers, including many parallel computers Tools The ASCI White system contains 8,192 interconnected processors. Its 6.2 terabyte (trillion byte) memory is about 97,000 times that of a 64-MB PC. Its 7,000 disk drives with 160 terabytes of storage space has about 16,000 times the storage capacity of a desktop computer with a 10-GB hard disk. Various tools provide means of accessing HDF5 files, including the data, metadata, and hierarchical structure, without having to write new software. How do we… Describe big data? Store it? Find it? Share it? Mine it? Move it into, out of, and between computers? HDFview, illustrated at the top of this image, displays the structure of a simple HDF5 file in one panel, raw data in another, and if appropriate an image or portion of it in a third. The larger image is the full, independentlygenerated gravity wave image. Visualization courtesy of John Shalf, NERSC/Lawrence Berkeley Laboratory, using data computed on the NERSC SP2 by Dennis Pollney and the Cactus Team, Albert Einstein Institute HDF5 File Structure Software Stacks Applications and readers, often customized for particular technical fields, enable users to create, manipulate, and view scientific and engineering data. With the support of intervening libraries, common interfaces, and HDF5, scientists and engineers in many fields are able to share data and software. Clusters and high performance computers include: ASCI Red ASCI Blue Mountain ASCI Blue Pacific ASCI White Various experimental clusters A file format and software to describe, organize, store, share, and access big data: Examples: Thermonuclear simulations Product modeling Data mining tools Visualization tools Climate models Big Applications: Simulations, Models, Visualization, … Specialized libraries and Common Interfaces use HDF5 layer for data management and often provide specialized metadata, context, and tools for data transformations and exchange. The HDF5 layer provides many data management functions, including machine-independent storage of all datatypes, metadata describing datatypes, user-defined attributes, etc., sophisticated subsetting and subsampling capabilities. Common Interfaces Readers Parallel UDM SAF LANL Parallel HDF5 uses MPI-IO to provide parallel file system functionality and global file access. LibSheaf LLNL, SNL IDL HDF-EOS TriLab NASA HDF5 (serial and/or parallel) HDF5 virtual file layer (I/O drivers) Stdio Split Files MPI I/O Custom Virtual File Layer The HDF5 VFL, or virtual file layer, provides access to many different data input and output mechanisms. The standard (stdio), split, and MPI drivers read from and write to files on storage media; the stream driver reads and writes virtual files or streams of data. The VFL also enables the creation of custom drivers, such as the stream driver, for specialized or user-defined situations. Copyright 2002 by the Board of Trustees of the University of Illinois Representative Technical Fields* in which HDF5 Is Used • Store large, complex scientific and engineering data sets • Retrieve complete data or partial data, easily and quickly • Enable parallel I/O, remote access, specialized access • A free, open standard developed by NCSA and the Lawrence Livermore, Sandia, and Los Alamos National Laboratories, with additional support from NASA The name HDF5 derives from the term hierarchical data format. An HDF5 file is a hierarchically structured set of groups, datasets, and metadata. Stream Storage ? File Split metadata and raw data files File on parallel file system User-defined device Computers and operating systems include: MacOS X MS Windows UNIX Linux FreeBSD OSF1 HP-UX IBM SP SGI IRIX64 Cray T3E Cray SV1 Sun Solaris IA-32 and IA-64 Across the network or to/from another application or library Aerospace Agricultural research Air traffic control Aircraft emissions database Applied mathematics Astrophysics Astrophysics / supernovae Atmospheric chemistry Atmospheric physics Bioengineering CEM Simulation Climatology / hydrology Computational fluid dynamics Computational physics Computational physics / education Computational physics and computational astrophysics Computer modeling Computer science Data processing Earth observation / atmospheric science Earth science Environment Fast searching, sorting and retrieval Film making special effects Fluid mechanics GIS Geodetic Science Geology Gravitational physics Hydrology Information technology Magnetic mass spectrometer development Marine biology / ecology Materials science Meteorological data products Meteorology Microscopy Molecular biology Nano device simulation Neutron scattering Ocean color Ocean remote sensing Optics / optoelectronics Petroleum engineering Photonic band gap studies Photonic crystals Photonics Post-fire erosion analysis Protein crystallography, molecular modeling Protostellar accretion discs Remote sensing SAR processing Satellite / weather radar remote sensing Satellite oceanography Semiconductor process simulation Software engineering, distributed systems Space geodesy Space physics Surface water flow and sediment transport Theoretical chemistry Visualization Volcanology Water resources management X-ray physics * from selected HDF5 download registrations, 15 October 2001 through 22 February 2002