Planning ASCR/Office of Science Data-Management Strategy Ray Bair1, Lori Diachin2, Stephen Kent3, George Michaels4, Tony Mezzacappa5, Richard Mount6, Ruth Pordes3, Larry Rahn7, Arie Shoshani8, Rick Stevens1, Dean Williams2 September 21, 2003 Introduction The analysis and management of digital information is now integral to and essential for almost all science. For many sciences, data management and analysis is already a major challenge and is, correspondingly, a major opportunity for US scientific leadership. This white paper addresses the origins of the data-management challenge, its importance in key Office of Science programs, and the research and development required to grasp the opportunities and meet the needs. Why so much Data? Why do sciences, even those that strive to discover simplicity at the foundations of the universe, need to analyze so much data? The key drivers for the data explosion are: The Complexity and Range of Scales of the systems being measured or modeled. For example, biological systems are complex with scales from the ionic interactions within individual molecules, to the size of the organism, to the communal interactions of individuals within a community. The Randomness of nature, caused either by the quantum mechanical fabric of the universe, that requires particle physics to measure billions of collisions to uncover physical truth, or by the chaotic behavior of complex systems, such as the Earth’s climate, that prevent any single simulation from predicting future behavior. The Technical requirement that any massive computation cannot be seen to be valid unless its detailed progress can be monitored, understood and perhaps checkpointed, even if the goal is the calculation of a single parameter. The Verification process of comparing the simulation data with experimental data that is fundamental to all sciences. This often requires the collection and management of vast amounts of instrument data, such as sensor data that monitor experiments or observe natural phenomena. Subsidiary driving factors include the march of instrumentation, storage, database and computational technology that bring many long-term scientific goals within reach, provided we can address both the data and the computational challenges. 1 ANL, 2 LLNL, 3 Fermilab, 4 PNNL, 5 ORNL, 6 SLAC, 7 SNL, 8 LBNL, Data Volumes and Specific Challenges Some sciences have been challenged for decades by the volume and complexity of their data. Many others are only now beginning to experience the enormous potential of large-scale data analysis, often linked to massive computation. The brief outlines that follow describe the data-management challenges posed by astrophysical data, biology, climate modeling, combustion, high-energy and nuclear physics, fusion energy science and supernova modeling. All fields face the immediate or imminent need to store, retrieve and analyze data volumes of hundreds of terabytes or even petabytes per year. The needs of the fields have a high degree of commonality when looking forward several years, even though the most immediately pressing needs show some variation. Integration of heterogeneous data, efficient navigation and mining of huge data sets, tracking the provenance of data created by many collaborators or sources, and supporting virtual data, are important across many fields. Biology A key challenge for biology is to develop an understanding of complex interactions at scales that range from molecular to ecosystems. Consequently research requires a breadth of approaches, including a diverse array of experimental techniques being applied to systems biology, which all produce massive data volumes. DOE’s Genomes To Life program is slated to establish several new high-throughput facilities in the next 5 years that will have the potential for petabytes/day data-production rates. Exploiting the diversity and relationships of many data sources calls for new strategies for high performance data analysis, integration, and management of very large, distributed and complex databases that will serve a very large scientific community. Climate Modeling To better understand the global climate system, numerical models of many components (e.g., atmosphere, oceans, land, and biosphere) must be coupled and run at progressively higher resolution, thereby rapidly increasing the data volume. Within 5 years, it is estimated that the data sets will total hundreds of petabytes. Moreover, large bodies of in-situ and satellite observations are increasingly used to validate models. Hence global climate research today faces the critical challenge of increasingly complex data sets that are fast becoming too massive for current storage, manipulation, archiving, navigation, and retrieval capabilities. Combustion Predictive computational models for realistic combustion devices involve three-dimensional, time-dependent, chemically reacting turbulent flows with multiphase effects of liquid droplets and solid particles, as well as hundreds of chemical reactions. The collaborative creation, discovery, and exchange of information across all of these scales and disciplines is a major challenge. This is driving increased collaborative efforts, growth in the size and number of data sets and storage resources, and a great increase in the complexity of the information that is computed for mining and analysis. On-the-fly feature detection, thresholding and tracking are required to extract salient information from the data. 2 Experimental High-Energy Particle and Nuclear Physics The randomness of quantum processes is the fundamental driver for the challenging volumes of experimental data. The complexity of the detection devices needed to measure high-energy collisions brings a further multiplicative factor and both quantum randomness and detector complexity require large volumes of simulated data complementing the measured data. Next generation experiments at Fermilab, RHIC, and LHC, for example, will push data volumes into the hundreds of petabytes attracting hundreds to thousands of scientists applying novel analysis strategies. Fusion Energy Science The mode of operation of DOE’s magnetic fusion experiments places a large premium on rapid data analysis which must be assimilated in near–real–time by a geographically dispersed research team, worldwide in the case of the proposed ITER project. Data management issues also pose challenges for advanced fusion simulations, where a variety of computational models are expected to generate a hundred terabytes per day as more complex physics is added. High-performance data-streaming approaches, advanced computational and data management techniques must be combined with sophisticated visualization to enable researchers to develop scientific understanding of fusion experiments and simulations. Observational Astrophysics The digital replacement of film as the recording and storage medium in astronomy, coupled with the computerization of telescopes, has driven a revolution in how astronomical data are collected and used. Virtual observatories now enable the worldwide collection and study of a massive, distributed, repository of astronomical objects across the electromagnetic spectrum, drawing on petabytes of data. For example, scientists forecast the opportunity to mine future multi-petabyte data sets to measure the dark matter and dark energy in the universe and to find near-earth objects. Supernova Modeling Supernova modeling has as its goal understanding the explosive deaths of stars and all the phenomena associated with them. Already, modeling of these catastrophic events can produce hundreds of terabytes of data and challenge our ability to manage, move, analyze, and render such data. Carrying out these tasks on the tensor field data representing the neutrino and photon radiation fields will drive individual simulation outputs to petabytes. Moreover, with visualization as an end goal, bulk data transfer for local analysis and visualization must increase by orders of magnitude relative to where they are today. Addressing the Challenges: Computer Science in Partnership with Applications All of the scientific applications discussed above depend on significant advances in data management capabilities. Because of these common requirements, it is desirable to develop tools and technologies that apply to multiple domains thereby leveraging an integrated program in scientific data-management research. The Office of Science is well-positioned to promote simultaneous advances in Computer Science and the other Office of Science 3 programs that face a data-intensive future. Key computer-science and technological challenges are outlined below. Low-latency/high transfer-rate bulk storage Today disks look almost exactly like tapes did 30 years ago. In comparison with the demands of processors, disks have abysmal random-access performance and poor transfer rates. The search for a viable commercial technology bridging the millisecond-nanosecond latency gap between disk and memory will intensify. The office of science should seek to stimulate relevant research and to partner with US industry in early deployment of candidate technologies in support of its key data-intensive programs. Very Large Data Base (VLDB) technology Database technology can be competitive with direct file access in many applications and brings many valuable capabilities. Commercial VLDB software has been scaled to the nearpetabyte level to meet the requirements of high-energy physics, with collateral benefits for national security and the database industry. The office of science should seek partnerships with the VLDB industry where such technology might benefit its data-intensive programs. High-speed streaming I/O High-speed parallel or coordinated I/O is required for many simulation applications. Throughput itself can be achieved by massive parallelism, but to make this useful, the challenges of coordination and robustness must be addressed for many-thousand stream systems. High-speed streaming I/O relates closely to ‘Data movement’ (below) but with a systems and hardware focus. Random I/O: de-randomization In default of a breakthrough in low-latency, low-cost bulk storage, the challenge for applications needing apparently random data access is to effectively de-randomize the access before the requests meet the physical storage device. Caching needs to be complemented by automated re-clustering of data, and automated re-ordering and load-balancing of access requests. All this also requires instrumentation and troubleshooting capabilities appropriate for the resultant large-scale and complex data-access systems. Data transformation and conversion At its simplest level, this is just the hard work that must be done when historically unconnected scientific activities need to merge and share data in many ad hoc formats. Hard work itself is not necessarily computer science, but a forward-looking effort to develop very general approaches to data description and formatting and to deploy them throughout the Office of Science would revolutionize scientific flexibility. Data movement Moving large volumes of data reliably over wide-area networks has historically been a tedious, error prone, but extremely important task for scientific applications. While data movement can sometimes be avoided by moving the computation to the data, it is often 4 necessary to move large volumes of data to powerful computers for rapid analysis, or to replicate large subsets of simulation/experiment data from the source location to scientists all over the world. Tools need to be developed that automate the task of moving large volumes of data efficiently, and recover from system and network failures. Data provenance The scientific process depends integrally on the ability to fully understand the origins, transformations and relationships of the data used. Defining and maintaining this record is a major challenge with very large bodies of data, and with the federated and distributed knowledge bases that are emerging today. Research is needed to provide the tools and infrastructure to manage and manipulate these metadata at scale. Data discovery As simulation and experiment data scales to tera- and petabyte databases, many traditional, scientist-intensive ways of examining and comparing data sets do not have enough throughput to allow scientists to make effective use of the data. Hence a new generation of tools is needed to assist in the discovery process, tools that for example can ingest and compare large bodies of data and extract a manageable flow of features. Data preservation and curation The data we are collecting today need to be read and used over several decades. The lifetime of scientific collaborations results in information technology changes over more frequent time scales. Advancing technologies for the preservation and curation of huge data stores over multi-decade periods will benefit a broad range of scientific data programs. Conclusion Science is becoming highly data-intensive. A large fraction of the research supported by the Office of Science faces the challenge of managing and rapidly analyzing data sets that approach or exceed a petabyte. Complexity, heterogeneity and an evolving variety of simultaneous access patterns compound the problems of pure size. To address this crosscutting challenge, progress on technology and computer science is required at all levels, from storage hardware to the software tools that scientists will use to manage, share and analyze data. The Office of Science is uniquely placed to promote and support the partnership between computer science and applications that is considered essential to rapid progress in dataintensive science. Currently supported work, notably the SciDAC Scientific Data Management Center, forms an excellent model, but brings only a small fraction of the resources that will be required. We propose that ASCR, in partnership with the application-science program offices, plan an expanded program of research and development in data management that will optimize the scientific productivity of the Office of Science. 5 Background Material on Applications Biology The challenge for biology is demonstrate an understanding of the complexities of interactions at an enormous breadth of scale. From atomic interactions to environmental impacts, biological systems are driven by diversity. Consequently, research in biology requires a breadth of approaches to effectively address significant scientific problems. Technologies like magnetic resonance, mass spectroscopy, confocal/electron microscopy for tomography, DNA sequencing and gene-expression analysis all produce massive data volumes. It is in this diversity of analytical approaches that the data-management challenge exists. Numerous databases already exist at a variety of institutions. High-throughput analytical biology activities are generating huge data-production rates. DOE’s “Genomes To Life” program will establish several user facilities that will generate data for systems biology at unprecedented rates and scales. These facilities will be the sites where large-scale experiment workflow must be facilitated, metadata captured, data must be analyzed, and systems-biology data and models provided to the community. Each of these facilities will need to develop plans that address the issues of providing easy and rapid searching of high-quality data to the broadest community in the application areas. As biology moves evermore toward being data-intensive science, new strategies for high-performance data analysis, integration, and management of very large, distributed and complex data-type databases will be necessary to continue the science at the grand challenge. Research is needed for new real-time analysis and storage solutions (both hardware and software) that can accommodate petabyte-scale data volumes and provide rapid analysis, data query and retrieval. Climate Modeling Global climate research today faces a critical challenge: how to deal with increasingly complex data sets that are fast becoming too massive for current storage, manipulation, archiving, navigation, and retrieval capabilities. Modern climate research involves the application of diverse sciences (e.g. meteorology, oceanography, hydrology, chemistry, and ecology) to computer simulation of different Earthsystem components (e.g. the atmosphere, oceans, land, and biosphere). In order to better understand the global climate system, numerical models of these components must be coupled together and run at progressively higher resolution, thereby rapidly increasing the associated data volume. Within 5 years, it is estimated that the data sets of climate modelers will total hundreds of petabytes. (For example, at the National Center for Atmospheric Research, a 100-year climate simulation by the Community Climate System Model (CCSM) currently produces 7.5 terabytes of data when run on ~250 km grid, and there are plans to increase this resolution by more than a factor of 3; further, the Japanese Earth Simulator is now able to run such climate simulations on a 10-km grid.) Moreover, in-situ and global satellite observations, which are vital for verification of climate model predictions, also produce very large quantities of data. With more accurate satellite instruments scheduled for future deployment, monitoring a wider range of geophysical variables at higher resolution also will demand greatly enhanced storage facilities and retrieval mechanisms. 6 Today, current climate data practices are highly inefficient, allowing ~90 percent of total data to remain unexamined. (This is mainly due to data disorganization on disparate tertiary storage, which keeps climate researchers unaware of much of the available data.) However, since the international climate community recently adopted the Climate Forecast (CF) data conventions, much greater commonality among data sets at different climate centers is now possible. Under the DOE SciDAC program, ESG is working to provide tools that solve basic problems in data management (e.g. data-format standardization, metadata tools, access control, and data-request automation), even though it does not address the direct need for increased data storage facilities. In the future, many diverse simulations must be run in order to improve the predictions of climate models. The development of both virtualized data and virtual catalogues in data generation will reduce the needed physical storage loads. Collaborations with other projects — including DOE-funded data initiatives, universities, and private industrial group — also will facilitate the climate community's development of an integrated multidisciplinary approach to data management that will be central to national interests in this area. Combustion The advancement of the DOE mission for efficient, low-impact energy sources and utilization relies upon continued significant advances in fundamental chemical sciences and the effective use of the knowledge accompanying these advances across a broad range of disciplines and scales. This challenge is exemplified in the development of predictive computational models for realistic combustion devices. Combustion modeling requires the integration of computational physical and chemical models that span space and time scales from atomistic processes to those of the physical combustion device itself. Combustion systems involve three-dimensional, time-dependent, chemically reacting turbulent flows that may include multiphase effects with liquid droplets and solid particles in complex physical configurations. Against this fluid-dynamical backdrop, chemical reactions occur that determine the energy production in the system, as well as the emissions that are produced. For complex fuels, the chemistry involves hundreds to thousands of chemical species participating in thousands of reactions. These chemical reactions occur in an environment that is defined by both thermal conduction and radiation. Reaction rates as a function of temperature and pressure are determined experimentally and by a number of methods using data from quantum mechanical computations. The collaborative creation, discovery, and exchange of information across all of these scales and disciplines are required to meet DOE’s mission requirements. This research includes the production and mining of extensive databases from direct simulations of detailed reacting flow processes. For example, with computing power envisioned by the SciDAC program for DOE science in the coming decade, DNS could provide insights into complex autoignition phenomena in three dimensions with hydrocarbon chemistry. With more efficient codes and these larger computers, the first practical 3D DNS study of auto ignition of n-heptane would consume ~100 terabytes of data storage. The problem definition, code implementation, and mining of these data sets will necessarily be carried out by a collaborative (and distributed) consortium of researchers. Moreover, this research leads to reduced models that must be validated, for example, Large Eddy Simulations (LES) that can describe geometrical and multi-phase effects along with reduced chemical 7 models. While these reduced model simulations typically will use fewer FLOPS per simulation, more simulations are required to validate models in environments that will enable predictive design codes to be developed. Collaborative data mining and analysis of large DNS data sets will require resources and a new paradigm well beyond the current practice of using advanced workstations and data transfer via ftp. The factors driving this change are the size and number of data sets that require unique storage resources, the increased collaborative nature of the research projects, and the great increase in the complexity of the information that can be computed for mining and analysis. On-the-fly feature detection, thresholding, and tracking along with postprocessing is required to extract salient information from the data. Feature-borne statistical and level-set analysis as well as feature-directed IO would help reduce the volume of data stored, and increase the scientific efficiency of gleaning new insight from massive fourdimensional data sets (time, and three directions in space). Thus, the analysis of data, both in real-time and post-computation, will require a new combustion framework. This framework will require both domain-specific analysis libraries (combustion, chemistry, turbulence) which will be layered on top of generic math libraries and visualization packages. This framework must be portable and remotely operated over the network. Experimental High-Energy Particle and Nuclear Physics The randomness of quantum processes is the fundamental driver for the challenging volumes of experimental data in high-energy and nuclear physics. The complexity of the detection devices needed to measure high-energy collisions brings a further multiplicative factor, and both quantum randomness and detector complexity require large volumes of simulated data complementing the measured data. Particle-collision data are compressed during acquisition suppressing all data elements that are consistent with no signal. Data are then reconstructed, collision-by-collision, creating a hierarchy of objects at the top of which are candidates for the particles and energy flows that emerge from each collision. The volume of measured, reconstructed and simulated data from the BaBar experiment at SLAC is now over one petabyte. The Fermilab Run-II experiments and the nuclear physics experiments at BNL’s RHIC collider are expected to overtake BaBar shortly. The LHC program at CERN will bring data volumes of hundreds of petabytes early in the next decade. All these programs have a rich physics potential, attracting hundreds to thousands of scientists to attempt novel analysis strategies on the petabytes of data. The data-analysis challenge is daunting: opportunities include facilitating unfettered, often apparently random, access to petabytes with a moderately complex structure and granularity at the level of kilobyte objects, and keeping track of the millions of derived data products that a collaboration of thousands of physicists will create. De-randomization (e.g.. re-clustering the data according to their properties and access pattern) will be vital: with a petabyte of disk storage today, it would take about one month to access each one of the1012 kilobyte objects randomly, assuming parallel access to every disk. As data volumes and disk storage densities increase, random-access analysis will take years or lifetimes without new approaches to derandomization. 8 Fusion Energy Science The Fusion Energy Sciences Research Community faces very challenging data management issues. The three main magnetic-fusion experimental sites in the United States operate in a similar manner. Magnetic-fusion experiments operate in a pulsed mode producing plasmas of up to 10 seconds duration every 10 to 20 minutes, with 25–35 pulses per day. For each plasma pulse up to 10,000 separate measurements versus time are acquired at sample rates from kHz to MHz, representing a few hundred megabytes of data. Typical run days archive about 18GB of data from these experiments. Throughout the experimental session, hardware/software plasma control adjustments are made as required by the experimental science. Decisions for changes to the next plasma pulse are informed by data analysis conducted within the roughly 15 minute between-pulse interval. This mode of operation places a large premium on rapid data analysis that can be assimilated in near–real–time by a geographically dispersed research team. Future large experiments, like the proposed multi–billion dollar international ITER project, will use superconducting magnetic technology resulting in much longer pulses and therefore more data (~1 terabyte/day). The international nature of ITER will require a data management technology with associated infrastructure that can support collaborator sites in near-real-time around the world. Data management issues also pose challenges for advanced fusion simulations. Although the fundamental laws that determine the behavior of fusion plasmas are well known, obtaining their solution under realistic conditions is a computational science problem of enormous complexity, which has led to the development of a wide variety of sophisticated computational models. Fusion simulation codes run on workstations, mid–range clusters of less than 100 processors, and parallel supercomputers that have thousands of processors. The most data-intensive code today can generate 1 terabyte/day with the expectation that this will grow to 100 terabyte/day as more complex physics is added to the simulation. Since the complexity of the data sets is growing, network parallel data-streaming technologies are being utilized to pipeline the simulation output into separate processes for automated parallel analysis of their results. Smaller subsets of the data can then be archived into advanced database systems. Data management methodologies must be combined with sophisticated visualization to enable researchers to develop scientific understanding of simulation results. Observational Astrophysics With the advent of modern digital imaging detectors and new telescope designs, astronomical data are being collected at an exponentially increasing rate. Projects underway now will soon be generating petabytes of data per year. Digitized data (obtained from both targeted and survey observations), coupled to powerful computing capabilities, are allowing astronomers to begin the construction of "virtual observatories," in which astronomical objects can be viewed and studied (on-line) across the electromagnetic spectrum. The increased sophistication of the analysis tools has also impacted the relationship between observations and theory/modeling, as the observational results have become more and more quantitative and precise, and therefore more demanding of theory. Data mining of these future multipetabyte data sets will lead to new discoveries such as measurement of the dark matter and dark energy in the Universe through gravitational "lensing" and the identification of large numbers of near-earth objects. Astronomical archives have a strong legacy value because any given observation represents a snapshot of a piece of the universe that is a continuum in both 9 space and time. Effective use of these data sets will require new strategies for data management and data analysis. Data sets must be made accessible to an entire community of researchers. Efficient access will require intelligent algorithms for organizing and caching of data. Access patterns will be varied, with some users analyzing long streams of data while others extract small subsets randomly. Data sets will have complex structures, and standardized methods will be needed to describe them. Supernova Modeling Supernova modeling has as its goal understanding the explosive deaths of stars either as thermonuclear, or Type Ia, supernovae or as core collapse, or Type Ib, Ic, and II, supernovae, and all of the phenomena associated with them. Type Ia supernovae are serving as “standard candles,” illuminating deep recesses of our universe and telling us about its ultimate fate. Core collapse supernovae are known to be a dominant source of the elements in the Periodic Table without which life as we know it would not exist. Thus, supernovae are central to our efforts to probe the universe and to our place in it. Modeling of these catastrophic events involves simulating the turbulent fluid flow in exploding, rotating stellar cores, the radiative transfer of photons and neutrinos that in part powers these explosions and that brings us information about the elements produced, and other information about the explosive dynamics, the evolution of the magnetic fields threading these dying stars, and the coupling of these model components. It is obvious that scalar, vector, and tensor fields are all present, and an understanding of supernova dynamics cannot be achieved without data management and, ultimately, visualization of these fields. While simulation of three-dimensional hydrodynamics at high resolution can produce hundreds of terabytes of simulation data and thus challenge our ability to manage, move, analyze, and render such data, carrying out these tasks on the tensor field data representing the neutrino and photon radiation fields will drive the development of data management and visualization technologies at an entirely new level. In a single three-dimensional radiation hydrodynamics simulation with multi-frequency radiation transport, at only moderate resolutions of 1024 zones in each of the three spatial dimensions and 24 groups in radiation frequency, the full radiation tensor stored for one thousand of the tens of thousands of simulation time steps will be several petabytes in size. Even if only certain critical components of the radiation tensor were stored (e.g. the radiation field multi-frequency energy densities and fluxes), with other components ignored, data sets of many hundreds of terabytes will be produced per simulation run. The production of data in these quantities will severely stress our ability to cache and archive data. Moreover, with visualization as an end goal and as the only real hope of comprehending such complex flows and quantities of data, networks too will be severely stressed. Bulk data transfer rates for the purposes of local analysis and visualization must increase by orders of magnitude relative to where they are today in order for data transfers to be completed in a reasonable amount of time, and the use, instead, of remote and, worse yet, collaborative visualization will require significant, dedicated bandwidth, on demand, and network protocols that will provide this. 10