An Integrative, Cross-Foundation Compute and Data Infrastructure for Science & Engineering Research “Multiple Authors in the Community1” Concept: In an era where science and engineering communities have become digital, and computational and data science is ever more critical to progress, the nation will benefit from a strong, predictable path for supporting computation- and data-intensive research across a broad spectrum of research efforts (e.g., individuals, groups, communities, and projects, including MREFCs), requiring diverse capabilities (e.g., HPC, HTC, data and cloud) and magnitudes (current HPC facilities are saturated, while the nation lacks national data services in an exponentially growing big data era). Current needs for computing and data resources are outstripping current capacities and with no clear long-term plan in place, research communities are both underserved and uncertain of the availability of the national resources and services in the future. Further, investments made in other regions have created a situation in which US researchers have access to less resource than their competitors internationally—a significant reversal of conditions. This disadvantages US researchers. To address the challenges facing our nation, research requires more deeply integrated approaches encompassing theory, experiment and/or observation, data and simulation science, and crossing disciplinary boundaries. Yet It is difficult to adopt a concerted uncertainties in the availability of national synergistic strategy for conducting computing and data resources and services lead to computational studies. There is no unique fragmentation, not integration, as individual one-stop shopping place for resources and major research projects prefer to develop their services for the US science community. own solutions; they cannot plan on synergies with NSF funds XSEDE (providing a fairly other national resources that may not be available modest amount of resources) and Blue in the future. It is also substantially more Waters (requiring a completely different expensive as research groups reinvent individual route for requesting resources); DOE funds cyberinfrastructure (CI) solutions over and over INCITE (providing considerable resources but constrained by the type of problems that again. Further, the needs may be big enough that can be considered); the single NIHindividual research groups could not possibly supported resource, is Anton at the PSC reinvent solutions. A long term commitment to an (providing small allocation but on a very integrated national computing and data efficient computer, but through yet another infrastructure is needed that recognizes the mechanism). importance of these resources for scientific and engineering progress as well as the importance of the technical culture, expertise and staff that can take years to develop around major computing and experimental facilities. We propose a new paradigm that develops a strategy for building, maintaining and operating a national computational and data-intensive resource infrastructure as well as the critical services 1 See list of contributors in “Contributors:” on page 10 that connect them. We call this the National Compute and Data Infrastructure, or NCDI. Paradigms around which to develop a path forward are available for research communities through such activities as the MREFC and the National Center for Atmospheric Research. Although modifying these approaches to be conducive for Foundation-wide activities represents a challenge, the paradigms are well understood and tested.2 It is clear that such an approach is needed by all of NSF’s Computational and Data-enabled Science and Engineering (CDS&E) communities, and numerous MREFC projects. Indeed, this approach need to be strongly driven by and coupled to Grand Challenge communities addressing complex interdisciplinary problems. To be effective, this approach should encompass a small number of large-scale computing facilities to support the most compute- and data-intensive research, coupled with a commensurate number of mid-range facilities serving the diverse needs of multiple research communities, as well as national-scale data repositories for multiple communities, high-speed networks to connect to them, and additional services that integrate these capabilities, including integration with the It is very important to have both large-scale growing environments on university campuses. and mid-range facilities round computing, Further, many research universities are data and instrumentation. There are leadership-class problems (some of Klaus struggling to define and support campus-level Schulten’s recent projects on Blue Waters are resources and are also going regional. Having good example), which are pushing the NSF lead through a coherent approach would envelope of what can be done, but there are have direct impact by (a) providing a blueprint also numerous mid-range problems that are for regional efforts to consider and (b) pooling mainly limited by availability of mid-range funds to increase leverage resources (e.g. in silico drug scoring by free energy methods, or conformational search for discovering druggable sites in proteins, etc). The industry is increasingly interested in this latter type of mid-range computations, but they are waiting to see if it is worth their investment. This research needs to be sustained from the academic side to become fully realized (otherwise this will probably fall through or be delayed by a decade). The criticality of a balanced portfolio of resources and services (compute, data and human!) cannot be overstated. There is considerable diversity among disciplines regarding their requirements, and possibly even their understanding of their own requirements. Focusing too strongly on any segment of this portfolio often results in constraining the success of significant segments of the research communtiy. For example, focusing too strongly on leadership class facilities often results in consolidation not only of exposure to the facilities but also in the success of research groups. In some cases, this can have very positive side effects, particularly as data repositories are made available, but in many other important aspects it simply shuts out a vast majority of researchers (particularly early stage). In many fields, the vast majority of research is conducted either at the "medium" scale, or on data repositories from leadership class simulations (which are few and far 2 Note the MREFC vehicle is used for multi-agency initiatives, e.g., between NSF and DOE, where MREFC and CD processes have to be intertwined. So this could in principle provide a mechanism for, say, NSF-NIH-DOE cooperation in providing a national computing and data infrastructure. 2 between, and for the most part located in Europe.) Though there is an understanding within many CDS&E communities, it needs to be made clear that high-end computing and big data, at least for many disciplines, are mutually reinforcing. For example, in climate modeling, the standard workflow is to consume substantial HPC resources in a batch compute mode and store the enormous output, primarily on tape, for later analysis and synthesis. The bigger the HPC resource, the more data that is generated, archived and retrieved multiple times by multiple researchers for a variety of different types of post-production interrogation. The larger and more complex the data, the more HPC resources required to do the analysis. This feedback loop represents an additional argument for a coherent cyberinfrastructure addressing computing, data and the critical support services associated with delivering these capabilities to the nation’s researchers. The proposed NCDI would combine some aspects of the current NSF advanced computing (Track 1, Track 2, and XSEDE) investments, coupled with new data and networking services, into a major national investment that would integrate these capabilities distributed across a number of sites. The proposed NCDI embraces this diversity, driven by community needs, as a core value. The next frontier in science is to produce new CI-based scientific environments with emergent properties that are capable of making science a true collective endeavor (from data generation to analysis and publication). In some areas of biology, speed of data acquisition outpaces computational analysis. In other areas, computation generates a wealth of electronic information waiting for experimental validation. This dichotomy requires developments of different models of CI management and sharing of information. Discussion: The NCDI is envisioned as a critical outcome of earlier CIF21 and other NSF programs that will unite major thrusts into an integrated framework that would benefit researchers, ranging from those working within the long tail of science to those at the extreme high end in fields all across NSF, from individual university campuses to the largest instruments that are being internationally developed and deployed. The thrusts are CI programs that are part of CIF21 relating to (a) computational, (b) data, and (c) networking infrastructure, with (d) digital services and (e) learning and workforce development programs, (f) all driven by grand challenge communities3 that connect to those activities on campuses, in research communities and in existing and future MREFC projects. The science-driven community efforts in each of these areas are maturing and are increasingly interdependent on CI. The NCDI would provide unique opportunities for researchers to define and use a highly responsive full-scale CI that reaches from instruments to computational, data stewardship and analysis capabilities to publications for long-term highly innovative research programs. The development of community standards and approaches during the MREFC design process—an ongoing activity in the envisioned NCDI—will accelerate research programs that will come to fruition during the construction and operation of the full scale infrastructure. With an NCDI in place, individual MREFC projects, each of which currently develops its own 3 Indeed, under the auspices of the ACCI, six task forces produced reports full of recommendations on each topic! 3 (often duplicative and typically less capable) CI, would then be able to leverage this long term, stable environment to deploy the needed computing, data stewardship and analysis facilities, and more deeply integrate the services of NCDI with other MREFCs, enabling funding to be focused on scientific and engineering issues and better supporting complex problem solving for disciplinary, multidisciplinary and interdisciplinary research. Where appropriate, other MREFC projects would be able to plan longer-term investments that explicitly build on the NCDI. Previously this type of CI in-depth knowledge was available on an ad-hoc basis with limited cross-domain expertise. NCDI would offer access to comprehensive intellectual resources previously unavailable to the MREFC cadre of participants. This would more naturally enable large-scale data services to be co-located with large scale computing facilities, which are needed in an era of big data, while better serving communities by leveraging commonly needed infrastructure, staff expertise, and services. The NCDI will enable a new level of scientific inquiry and effectively support addressing complex problems that will require participation of multiple MREFCs, compute, and data facilities, such as Multi-Messenger Astronomy. NCDI would become a national resource, working cooperatively with the university community, for nurturing, developing, and training the workforce of the 21st century. Like the computational facilities, the workforce development is fragmented; NCDI could change that in a significant way. An examination of the process employed in building and operating research infrastructure and services, such as the MFEFC process, is instructive with respect to what will work for building and sustaining NCDI and what needs to be modified from these examples. For example, as the component technologies evolve with characteristic times much shorter than the standard MREFC cycle, a set of inter-related design-build/integrate-deploy processes must be running concurrently with the operation of facilities and services to provide a robust, reliable and secure NCDI to provide the production needs of the nation’s CDS&E community and upon which other MREFCs can also build. Although long-term infrastructure is a continuing effort, longer than normal NSF Research and Related Activities (R&RA) awards, it is also important to encourage innovation and some level of competition via the proposal process. In order to ensure effectiveness of resources, NSF could consider options that would allow for a series of continuing awards, within the MREFC process, based on merit, successful operation, and ability to meet the various communities’ needs. For example, previous NSF programs have used rolling 5-year awards, with “continue,” “phase out” or “re-compete management” decisions through periodic reviews. Another example is the fifty years of sustained computational support offered to the atmospheric and related communities at the National Center for Atmospheric Research which was strengthened and made more responsive to meeting community needs by periodic reviews on five-year cycles and re-competition of management. Processes for assessing the NCDI and the evolution of community computing needs should be a part of these review processes. Additional activities, such as campus bridging, algorithm development, software institutes, related disciplinary research, and so on could be supported separately, as they are at present, but could also be more effective and collaborative by being integrated into NCDI. These activities should be tightly linked to NCDI. The NCDI will acknowledge/accommodate/embrace the issues facing university-based researchers that might be addressed by campus bridging and software activities. For example, full discoverability and 4 interoperability of all relevant data for a given investigation, whether they reside in the cloud, in a formal data repository or on a researcher’s laptop, has been a holy grail for many disciplines, but has been achieved at only a very limited and rudimentary level by only a few. Further, interdisciplinary interoperability remains a dream, despite progress in some areas (e.g. EarthCube). The NCDI strategy will provide hooks and incentives for the development of data and algorithmic interoperability. It is understood that the granularity of these activities varies greatly, so NCDI cannot solve these problems centrally. It is worth noting here that the context of the spirit and intent of an MREFC was initially to support infrastructure and services, not the research that uses them. For the MREFC approach to work here, there will have to be significant agreements reached on what constitutes a shared infrastructure for research — a common denominator for a broad range of applications. The simple collection of domain-specific requirements will not constitute the definition of an MREFC. It must integrate distributed facilities to serve broad needs. As noted above, there is considerable redundancy in NSF’s CI investments which constitute not only a loss in available financial resources, but also lost opportunity costs and worse case, balkanization. Striking the balance that most cost-effectively serves diverse needs will be critical. In this model, NCDI construction funds could come from the MREFC allocation or another major budget initiative, with a profile that is estimated to be in the range of $250-300M per year (of order $3B over a decade). Assuming the NCDI would be managed by ACI, the operating funds would come from CISE, with contributions from other directorates based on their specific needs. This would further allow ACI (and other NSF directorate programs) to better support its portfolio of services, software, development, CDS&E programs, and so on. Advantages: The NCDI will provide the research community with a clear roadmap for planning a broad range of CDS&E research activities, from individual investigations to major facility projects; The NCDI will provide a means to coalesce community engagement around computing and data infrastructure; The NCDI will provide an integrated approach to data-intensive (Big Data) science. Big Data services, e.g., astronomy, environmental science, data pipelines, etc., will be more naturally and effectively supported and can be co-located with simulation services at major computing sites. This will drive a convergence of big data and big compute activities at common sites, which will be required as data volumes grow; The longer-term stability of the NCDI will enable the retention and building of worldclass technical staff, an intellectual commons for communities to gain and sharing their experience, and serve as a linchpin for university community to build the workforce of the 21st century, that is so critical to success in the rapidly evolving computation- and data-sciences. This is especially important for emerging data services, both at national and campus levels, that will need a framework that lives long past any individual grants; The NCDI will provide a national structure that other MREFC projects can depend on being in place, enabling them to plan for and take advantage of synergies provided by leveraging such an infrastructure leading to significant cost efficiencies; 5 Exploiting such synergies, NCDI could reduce the cost of providing the major compute and data resources and services needed for scientific progress, resulting in savings to both CISE/ACI and NSF; The NCDI will enable NSF’s MREFC projects to focus more on science and engineering issues since they will not have to completely reinvent their own CI and can leverage the investments in the NCDI for their own purposes; The NCDI will enable NSF projects to more fully exploit synergies that will be needed to better support interdisciplinary research trends (e.g., multi-messenger astronomy, which will require unified or coordinated computing and data services from multiple MREFCs and communities, such as LIGO, IceCube, LSST, DES, computational astrophysics, etc.); The NCDI will support interagency cooperation around MREFC projects (e.g., with DOE and/or NIH) and can be leveraged for deeper cooperation. Disadvantages: A new process needs to be developed or modification of current processes needs to implemented in a way that can draw on success of previous efforts. For example, the MREFC process has been designed around a model of building a single instrument to be used by a single community for a decade or more, and has a very heavy, very long process of community engagement, conceptual design, preliminary design, final design, and NSB approval. This process would require modification to accommodate the fast paced evolution of computing technologies; a process already being developed by the XSEDE project. Significant urgent work will be needed among all of the stakeholders in the NCDI (research communities, NSF, OSTP and others). What often takes more than a decade would ideally be compressed into just a few years, and would have to be done at many levels. This would require dedication of a number of people in the community over a sustained period of time to define and implement the inter-related, concurrent processes necessary at all of these levels. On a related note, if, in fact, the MREFC would be managed by ACI, there is significant concern in the community about its current organizational location within CISE. The transition of OCI on the Office of the Director to ACI in CISE has induced a perception that the current situation with NSF computing has a lot to do with an institutional resistance (on the part of NSF) to funding infrastructure. Moving away from a cross-cutting OCI toward relocation as ACI within CISE is seen as a signal that such cross-cutting coherent cyberinfrastructure planning and investment is of diminishing importance to the Foundation to the detriment of NSF-supported communities as a whole. A conceptual plan, whether it is a new approach or a modification of existing paradigms, needs to be developed as soon as possible. This would need to involve numerous stakeholders and communities, including existing MREFC projects, to explore how such an environment would operate. It is likely that such an approach would lead to a significant restructuring of existing CI centers, university stakeholders, etc. Many assumptions above would need to be quantitatively validated, and additional issues would need to be considered to determine the viability of this 6 approach. In particular, the need for and benefits of National CI Infrastructure would need to considered in depth, with not only its scope and cost considered, but its impact on US science and national competitiveness, training and workforce development, the role of NSF with respect to other agencies, and more. This concept paper merely aims to raise awareness of the possibility of such an approach and to highlight issues that would need to be considered in exploring its viability. 7 Contributors from the Community Steering Committee The efforts of developing this concept are guided by a steering committee consisting of: Dan Atkins Professor of Information, School of Information Professor of Electrical Engineering and Computer Science College of Engineering, University of Michigan Alan Blatecky Visiting Fellow, RTI International Thom Dunning Co-director, Northwest Institute for Advanced Computing Affiliate Professor of Chemistry, University of Washington Irene Fonseca Mellon College of Science University Professor of Mathematics Director, Center for Nonlinear Analysis Department of Mathematical Sciences, Carnegie Mellon University Sue Fratkin Washington liaison for the Coalition for Academic Scientific Computation Sharon Glotzer Stuart W. Churchill Collegiate Professor of Chemical Engineering Professor of Material Science & Engineering Macromolecular Science and Engineering, and Physics, University of Michigan Cliff Jacobs Clifford A. Jacobs Consulting, LLC 8 Gwen Jacobs Director of Cyberinfrastructure, University of Hawaii Thomas H. Jordan Director, Southern California Earthquake Center University Professor & W. M. Keck Professor of Earth Sciences Department of Earth Sciences, University of Southern California Steve Kahn Director of the Large Synoptic Survey Telescope Cassius Lamb Kirk Professor of Physics, Stanford University Bill Kramer Director, @scale Program Office, National Center for Supercomputing Applications PI, Blue Waters Anthony Mezzacappa Newton W. and Wilma C. Thomas Endowed Chair Department of Physics and Astronomy, University of Tennessee, Knoxville and Director, Joint Institute for Computational Sciences Oak Ridge National Laboratory Mike Norman Director, San Diego Supercomputer Center Professor of Physics, UC San Diego David Reitze Executive Director of the LIGO Laboratory, Caltech Professor of Physics, University of Florida 9 Ralph Roskies co-Director, Pittsburgh Supercomputing Center Professor of Physics, University of Pittsburgh Ed Seidel Director, NCSA Founder Professor of Physics, University of Illinois at Urbana-Champaign Dan Stanzione Executive director, Texas Advanced Computing Center Rick Stevens Associate Lab Director Computing, Environment and Life Sciences, Argonne National Laboratory Professor of Computer Science, University of Chicago John Towns Executive Director, National Center for Supercomputing Applications PI and project director, XSEDE Paul Woodward Professor, Minnesota Institute for Astrophysics Director, Laboratory for Computational Science and Engineering Contributors: Other contributors include: Aleksei Aksimentiev Associate Professor of Physics University of Illinois at Urbana-Champaign Scott L. Althaus Charles J. and Ethel S. Merriam Professor of Political Science Professor of Communication 10 Director, Cline Center for Democracy University of Illinois at Urbana-Champaign Stuart Anderson Research Manager – LIGO Dinshaw Balsara Associate Professor, Astrophysics Concurrent Associate Professor, Department of Applied and Computational Mathematics and Statistics Department of Physics, University of Notre Dame J. Bernholc Drexel Professor of Physics Director, Center for High Performance Simulation North Carolina State University Daniel J. Bodony Blue Waters Associate Professor of Aerospace Engineering University of Illinois at Urbana-Champaign Dr. Gustavo Caetano-Anollés Professor of Bioinformatics, University Scholar Department of Crop Sciences, University of Illinois at Urbana-Champaign Santanu Chaudhuri, PhD Director, Accelerated Materials Research Illinois Applied Research Institute College of Engineering, University of Illinois at Urbana-Champaign Dr. Larry Di Girolamo Blue Waters Professor 11 Department of Atmospheric Sciences University of Illinois at Urbana-Champaign Dr. David A. Dixon Robert Ramsay Chair Department of Chemistry, The University of Alabama Said Elghobashi Professor Mechanical and Aerospace Engineering, University of California, Irvine Charles F. Gammie Professor, Center for Theoretical Astrophysics Astronomy and Physics Departments, University of Illinois at Urbana-Champaign Claudio Grosman Professor School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign Sharon Hammes-Schiffer Swanlund Professor of Chemistry Department of Chemistry, University of Illinois at Urbana-Champaign Victor Hazlewood Chief Operating Officer Joint Institute for Computational Sciences, University of Tennessee Curtis W. Hillegas, Ph.D. Associate CIO for Research Computing, Princeton University Eric Jakobsson Director, National Center for Biomimetic Nanoconductors 12 Professor Emeritus of Biochemistry School of Molecular and Cellular Biology, University of Illinois Peter Kasson Associate Professor Departments of Molecular Physiology and Biomedical Engineering, University of Virginia Jim Kinter Director, Center for Ocean-Land-Atmosphere Studies Professor, Climate Dynamics Dept. of Atmospheric, Oceanic & Earth Sciences, George Mason University Martin Ostoja-Starzewski Professor Department of Mechanical Science and Engineering, University of Illinois at Urbana-Champaign Christian D. Ott Professor of Theoretical Astrophysics California Institute of Technology Vijay Pande Director, Folding@home Distributed Computing Project Director, Program in Biophysics Camille and Henry Dreyfus Professor of Chemistry and, by courtesy, of Structural Biology and of Computer Science, Stanford University Nikolai V. Pogorelov Professor Department of Space Science, University of Alabama Tom Quinn Professor Department of Astronomy, University of Washington 13 Patrick M. Reed Professor School of Civil and Environmental Engineering Faculty Fellow, Atkinson Center for a Sustainable Future Cornell University Benoit Roux Department of Biochemistry and Molecular Biology Department of Chemistry University of Chicago Andre Schleife Blue Waters Assistant Professor Department of Materials Science and Engineering, University of Illinois, Urbana-Champaign Barry Schneider Staff Member Applied and Computational Mathematics National Institute of Standards and Technology (NIST) John Shalf Chief Technology Officer, NERSC Ashish Sharma Environmental Change Initiative University of Notre Dame Bob Stein Professor Physics and Astronomy Physics and Astronomy Department, Michigan State University Brian G. Thomas C.J. Gauthier Professor of Mechanical Engineering Department of Mechanical Science & Engineering, University of Illinois at Urbana-Champaign 14 Jorge Vinals Director, Minnesota Supercomputing Institute University of Minnesota Gregory A. Voth, Ph.D. Haig P. Papazian Distinguished Service Professor Department of Chemistry, The University of Chicago Shaowen Wang Director, CyberInfrastructure and Geospatial Information (CIGI) Laboratory Professor and Centennial Scholar of Geography and Geographic Information Science School of Earth, Society, and Environment, University of Illinois Don Wuebbles Harry E. Preble Endowed Professor Department of Atmospheric Sciences, University of Illinois at Urbana-Champaign P. K. Yeung Professor School of Aerospace Engineering, Georgia Institute of Technology 15