TGExt_Prop_Draft-v1.20

advertisement
TeraGrid Extension: Bridging to XD
A
TeraGrid Extension: Bridging to XD
Submitted to the National Science Foundation as an invited proposal.
Principal Investigator
Ian Foster
Director, Computation Institute
University of Chicago
5640 S. Ellis Ave, Room 405
Chicago, IL 60637
Tel: (630) 252-4619
Email: foster@mcs.anl.gov
Co-Principal Investigators
John Towns
TeraGrid Forum Chair
Director, Persistent Infrastructure,
NCSA/Illinois
Matthew Heinzel
Deputy Director, TeraGrid GIG
U Chicago
Senior Personnel
Phil Andrews
Project Director, NICS/U Tennessee
Rich Loft
Director of Technology Development, CISL/NCAR
Jay Boisseau
Director, TACC/U Texas Austin
Richard Moore
Deputy Director, SDSC/UCSD
John Cobb
???, ORNL
Ralph Roskies
Co-Scientific Director, PSC
Nick Karonis
Professor and Acting Chair, Department of
Computer Science, NIU
Carol X. Song
Senior Research Scientist, RCAC/Purdue
Daniel S. Katz
Director for Cyberinfrastructure
Development, CCT/LSU
Craig Stewart
Associate Dean, Research Technologies, Indiana
i
TeraGrid Extension: Bridging to XD
B
Project Summary
The TeraGrid is an advanced, nationally distributed, open cyberinfrastructure (CI) comprised of
supercomputing, storage, analysis, and visualization systems, data services, and science gateways,
connected by high-bandwidth networks, integrated by coordinated policies and operations, and
supported by computing and technology experts, that enables and supports leading-edge scientific
discovery and promotes science and technology education. TeraGrid's three-part mission is
summarized as “deep, wide, and open”: supporting the most advanced computational science in
multiple domains; expanding usage and impact, and empowering new communities; and providing
resources and services that can be extended to a broader CI, enabling researchers and educators to
use TeraGrid resources in concert with personal, local/campus, and other large-scale CI resources.
TeraGrid has enabled numerous scientific achievements in almost all fields of science and
engineering. The project is user-driven is driven by user projects, which are peer reviewed for their
potential transformative impact enabled by our advanced CI resources and support. Through the
coordinated capabilities of its staff, resources, and services, TeraGrid empowers enables deep
impact through cutting-edge, even transformative science and engineering, by experts and expert
teams of users making highly-skilled use of TeraGrid resources. TeraGrid also supports a wider
community of much larger, domain-focused groups of users that may not possess specific highperformance computing skills but who are addressing important scientific research and education
problems.
Advanced support joins researchers with TeraGrid experts to increase efficiency and productivity,
define best practices, and create a vanguard of early adopters of new capabilities. Advanced
scheduling and metascheduling services support use cases that require or benefit from cross-site
capabilities. Advanced data services provide a consistent, high-level approach to multi-site data
management and analysis. Support for science gateways provides community-designed interfaces
to TeraGrid resources and extends access to data collections, community collaboration tools, and
visualization capabilities to a much wider audience of users.
Transformative science and engineering on the TeraGrid also depends on its resources working in
concert, which requires a coordinated user support system, centralized mechanisms for user access
and information, a common allocations process and allocations management, and a coordinated
user environment. Underlying this user support environment, TeraGrid maintains a robust, centrally
managed infrastructure for networking, security and authentication, and operational services.
Intellectual Merit: TeraGrid will continue to support several resources into 2011 under this proposal.
This includes the Track 2 systems, Pople, a shared memory system particularly useful to a number
of newer users, and four IA32-64 clusters. The Track 2 systems provide petascale computing
capabilities for very large simulations; the IA32-64 clusters provide nearly 270 TF of computing
power, to support large-scale interactive, on-demand, and science gateway use. Allocated as a
single resource, these latter systems will permit metascheduling and advanced reservations; allow
high-throughput, Open Science Grid-style jobs; enable exploration of interoperability and technology
sharing; and provide a transition platform for users coming from university- or departmental-level
resources. TeraGrid will also support unique compute platforms and massive storage systems, and
will integrate new systems from additional OCI awards.
Broader Impacts: TeraGrid will also provide vigorous efforts in training, education and outreach.
Through these efforts, TeraGrid will engage and retain larger and more diverse communities in
advancing scientific discovery. TeraGrid will engage under-represented communities, in which
under-representation includes race, gender, disability, discipline, and institution, and continue to
build strong partnerships in order to offer the best possible HPC learning and workforce
development programs and increase the number of well-prepared STEM researchers and educators.
B-1
TeraGrid Extension: Bridging to XD
C
Table of Contents
A TeraGrid Extension: Bridging to XD ............................................................................................. i
B Project Summary ....................................................................................................................... B-1
C Table of Contents ........................................................................................................................... i
D Project Description .................................................................................................................... D-1
D.1 Introduction .......................................................................................................................... D-1
D.1.1
TeraGrid Organization and Management ................................................................... D-2
D.1.2
Advisory Groups .......................................................................................................... D-3
D.2 TeraGrid Science ................................................................................................................. D-3
D.2.1
Geosciences: Understanding Earthquakes ................................................................ D-4
D.2.2
Nanoscale Electronic Structures/nanoHUB – PI Gerhard Klimeck, Purdue UniversityD-5
D.2.3
Atmospheric Sciences (LEAD), PI Kelvin Droegemeier, University of Oklahoma ..... D-5
D.2.4
Astronomy – PI Mike Norman, UCSD, Tom Quinn, U. Washington........................... D-6
D.2.5
Biochemistry/Molecular Dynamics – Multiple PIs (Adrian Roitberg, U. Florida, Tom
Cheatham, U. Utah, Greg Voth, U. Utah, Klaus Schulten, UIUC, Carlos Simmerling, SUNY
Stony Brook, etc.) ....................................................................................................................... D-6
D.2.6
Biosciences (Robetta) – PI David Baker, University of Washington .......................... D-7
D.2.7
Structural Engineering – Multiple NEES PIs ............................................................... D-7
D.2.8
Social Sciences (SIDGrid) – PI Rick Stevens, University of Chicago; TeraGrid
Allocation PI, Sarah Kenny, University of Chicago .................................................................... D-8
D.2.9
Biosciences – PI George Karniadakis, Brown University ........................................... D-8
D.2.10
Neutron Science – PI John Cobb, ORNL ................................................................... D-9
D.2.11
GIScience – PI Shaowen Wang, University of Illinois ................................................ D-9
D.2.12 Computer Science: Solving Large Sequential Two-person Zero-sum Games of
Imperfect Information – PI Tuomas Sandholm, Carnegie Mellon University............................. D-9
D.2.13 Chemistry (GridChem) – Project PI John Connolly, University of Kentucky; TeraGrid
Allocation PI Sudhakar Pamidighantam, NCSA ...................................................................... D-10
D.3 Advanced Capabilities Enabling Science ....................................................................... D-10
D.3.1
Advanced User Support ............................................................................................ D-10
D.3.2
Advanced Scheduling and Meta Scheduling ............................................................ D-12
D.3.3
Advanced Data Services ........................................................................................... D-12
D.3.4
Science Gateways .................................................................................................... D-14
D.3.5
Gateway Targeted Support Activities ....................................................................... D-14
D.4 Supporting the User Community ..................................................................................... D-16
D.4.1
User Information and Access Environment .............................................................. D-17
D.4.2
User Authentication and Allocations ......................................................................... D-17
D.4.3
Frontline User Support .............................................................................................. D-18
i
TeraGrid Extension: Bridging to XD
D.4.4
Training ..................................................................................................................... D-19
D.5 Integrated Operations of TeraGrid .................................................................................. D-20
D.5.1
Packaging and maintaining CTSS Kits ..................................................................... D-20
D.5.2
Information Services ................................................................................................. D-21
D.5.3
Supporting Software Integration and Information Services ...................................... D-21
D.5.4
Networking ................................................................................................................ D-21
D.5.5
Security ..................................................................................................................... D-22
D.5.6
Quality Assurance ..................................................................................................... D-22
D.5.7
Common User Environment...................................................................................... D-22
D.5.8
Operational Services ................................................................................................. D-23
D.5.9
RP Operations ........................................................................................................... D-24
D.6 Education, Outreach, Collaboration, and Partnerships ................................................ D-26
D.6.1
Education .................................................................................................................. D-26
D.6.2
Outreach.................................................................................................................... D-27
D.6.3
Enhancing Diversity .................................................................................................. D-28
D.6.4
External Relations (ER)............................................................................................. D-29
D.6.5
Collaborations and Partnerships ............................................................................... D-29
D.7 Project Management and Leadership ............................................................................. D-30
D.7.1
Project and Financial Management .......................................................................... D-30
D.7.2
Leadership................................................................................................................. D-30
ii
TeraGrid Extension: Bridging to XD
D
Project Description
D.1 Introduction
TeraGrid's three-part mission is to support the most advanced computational science in multiple
domains, to empower new communities of users, and to provide resources and services that can be
extended to a broader cyberinfrastructure.
The TeraGrid is an advanced, nationally distributed, open cyberinfrastructure comprised of
supercomputing, storage, and visualization systems, data collections, and science gateways,
integrated by software services and high bandwidth networks, coordinated through common policies
and operations, and supported by computing and technology experts, that enables and supports
leading-edge scientific discovery and promotes science and technology education
Accomplishing this vision is crucial for the advancement of many areas of scientific discovery,
ensuring US scientific leadership, and increasingly, for addressing critical societal issues. TeraGrid
achieves its purpose and fulfills its mission through a three-pronged focus:
deep: ensure profound impact for the most experienced users, through provision of the most
powerful computational resources and advanced computational expertise;
wide: enable scientific discovery by broader and more diverse communities of researchers
and educators who can leverage TeraGrid’s high-end resources, portals and science
gateways; and
open: facilitate simple integration with the broader cyberinfrastructure through the use of
open interfaces, partnerships with other grids, and collaborations with other science research
groups delivering and supporting open cyberinfrastructure facilities.
The TeraGrid’s deep goal is to enable transformational scientific discovery through leadership
in HPC for high-end computational research. The TeraGrid enables high‐end science utilizing
powerful supercomputing systems and high‐end resources for the data analysis, visualization,
management, storage, and transfer capabilities required by large‐scale simulation and analysis. All
of this requires an increasingly diverse set of leadership‐class resources and services, and deep
intellectual expertise.
The TeraGrid’s wide goal is to increase the overall impact of TeraGrid’s advanced
computational resources to larger and more diverse research and education communities
through user interfaces and portals, domain specific gateways, and enhanced support that
facilitate scientific discovery by people without requiring them to become high performance
computing experts. The complexity of using TeraGrid’s high‐end resources continues to grow as
systems increase in scale and evolve with new technologies. TeraGrid broadens the scientific user
base of its resources via the development and support of simple and powerful interfaces, ranging
from common user environments to Science Gateways and portals, through more focused outreach
and collaboration with science domain research groups, and by educational and outreach efforts that
will help inspire and educate the next generation of America’s leading‐edge scientists.
TeraGrid’s open goal is twofold: to ensure the expansibility and future viability of the TeraGrid
by using open standards and interfaces; and to ensure that the TeraGrid is interoperable with
other, open-standards-based cyberinfrastructure facilities. TeraGrid must enable its high-end
cyberinfrastructure to be more accessible from, and even integrated with, cyberinfrastructure of all
scales. That includes not just other grids, but also campus cyberinfrastructures and even individual
researcher labs/systems. The TeraGrid leads the community forward by providing an open
infrastructure that enables, simplifies, and even encourages scaling out to its leadership-class
resources by establishing models in which computational resources can be integrated both for
D-1
TeraGrid Extension: Bridging to XD
current and new modalities of science. This openness includes interfaces and APIs, but goes further
to include appropriate policies, support, training, and community building.
This proposal extends the TeraGrid program of enabling transformative scientific research through
an additional 12-month period, from April 2010 to March 2011, until the revised start date of April
2011 for “TeraGrid Phase III: eXtreme Digital Resources for Science and Engineering” (XD). This
includes many of the integrative activities of the Grid Integration Group (GIG) as well as an
extension of the highest-value resources not separately funded under the Track 2 or Track 1
solicitations. TeraGrid will operate resources that would otherwise not be available to users, such as
Pople, a shared memory systems that is particularly useful to many newer users, in areas such
game theory, web analytics, machine learning, etc, as well as being a key part of the workflow of
several more established applications; and IA32-64 clusters, providing nearly 270 TF of computing
power, that will support large-scale interactive, on-demand, and science gateway use. Allocated as a
single resource, these IA32-64 clusters will permit metascheduling and advanced reservations; allow
high-throughput, Open Science Grid-style jobs; enable exploration of interoperability and technology
sharing; and provide a transition platform for users coming from university- or departmental-level
resources. TeraGrid will provide networking both within the TeraGrid and to other resources, such as
on campuses or in other cyberinfrastructures. It will provide common grid software to enable easy
use of multiple TeraGrid and non-TeraGrid resources, including expanding wide-area filesystems. It
will provide and enhance the TeraGrid User Portal (enabling single-sign-on, common access to
TeraGrid resources and information), services such as metascheduling (automated selection of
specific resources), co-scheduling (use of multiple resources for a single job), reservations (use of a
resource at a specific time), workflows (use of single or multiple resources for a set of jobs), and
gateways (interfaces to resources that hide complex features or usage patterns, or tie TeraGrid
resources to additional datasets and capabilities). It will provide an extensive set of user support
services to enable the scientific community to make best use of resources in creating transformative
research. It will provide a vigorous education, training and outreach program to raise awareness of
TeraGrid’s potential for scientific discovery and increasing proficiency in exploiting that potential.
D.1.1 TeraGrid Organization and Management
The coordination and management of the TeraGrid partners and resources requires organizational
and collaboration mechanisms that are different from a classic organizational structure. The existing
structure and practice has evolved from many years of collaboration, many predating the TeraGrid.
The TeraGrid team is comprised of eleven
resource providers (RPs) and the Grid
Infrastructure Group (GIG) (Figure 1). The
GIG provides user support coordination,
software
integration,
operations,
management and planning. GIG area
directors (ADs) direct project activities
involving staff from multiple partner sites,
coordinating and maintaining TeraGrid
central services.
TeraGrid policy and governance rests with
the TeraGrid Forum (TG Forum),
comprised of RP PIs and the GIG PI. The
TG Forum is led by a Chairperson—a
Figure 1: TeraGrid Facility Partner Institutions
GIG-funded position—filled by Towns this
past year as a result of an election
process within the TG Forum. This position facilitates the functioning of the TG Forum on behalf of
the overall collaboration.
D-2
TeraGrid Extension: Bridging to XD
TeraGrid management and planning is coordinated via a series of regular meetings, including
weekly project-wide Round Table meetings (held via Access Grid), weekly TeraGrid AD and
biweekly TG Forum teleconferences, and quarterly face-to-face internal project meetings. This past
year saw the first execution of a fully integrated annual planning process developed over the past
several years. Coordination of project staff in terms of detailed technical analysis and planning is
done through two types of technical groups: working groups and Requirement Analysis Teams
(RATs). Working groups are persistent coordination teams and in general have participants from all
RP sites; RATs are short-term (6-10 weeks) focused planning teams that are typically small, with
experts from a subset of both RP and GIG. Both groups make recommendations to the TG Forum
or, as appropriate, to the GIG management team.
D.1.2 Advisory Groups
The NSF/TeraGrid Science Advisory Board (SAB) consists of 14 people from a wide spectrum of
disciplines. The SAB provides advice to the TG Forum and the NSF TeraGrid Program Officer on a
wide spectrum of scientific and technical activities within or involving the TeraGrid. The SAB
considers the progress and quality of these activities, their balance, and the TeraGrid’s interactions
with the national and international research community, with the ultimate aim of building a more
unified TeraGrid and enhancing the progress of those aspects of academic research and education
that require high-end computing. The SAB advises on future TeraGrid plans, identifies synergies
between TeraGrid activities and related efforts in other agencies, promotes the TeraGrid mission
and its activities in the national and international community, and provides help in building and
expanding the TeraGrid community.
The SAB members are: Chair: James Kinter, Director of Center for Ocean-Land-Atmosphere
Studies; Bill Feiereisen, New Mexico; Thomas Cheatham, Utah; Gwen Jacobs, Montana State; Dave
Kaeli, Northeastern; Michael Macy, Cornell; Phil Maechling, USC; Alex Ramirez, HACU; Nora
Sabelli, SRI; Pat Teller, UTexas, El Paso; P. K. Yeung, Georgia Tech; Cathy Wu, Georgetown; Eric
Chassignet, Florida State; and Luis Lehner, Louisiana State.
D.2 TeraGrid Science
The TeraGrid supports, enables, and accelerates scientific research and education that requires the
high-end capabilities offered by a national cyberinfrastructure of high-end resources, services and
expert support. This comprehensive cyberinfrastructure enables many usage modalities and levels
of users with emphasis on breakthrough, even transformative, results for all projects. The usage of
TeraGrid and the resulting impact can be categorized according to the TeraGrid mission focusing
principles: deep, wide, and open. The TeraGrid is oriented towards user-driven projects, with each
project being led by a PI who applies for an allocation of TeraGrid resources. A project can consist of
a set of users identified by the PI, or a community represented by the PI. Tthe TeraGrid’s deep focus
represents projects that are usually small, established groups of expert users making highly-skilled
use of TeraGrid resources, and the TeraGrid’s wide focus represents projects that are either new or
established science communities using TeraGrid resources for both research and education, without
requiring specific high-performance computing skills, even for users who are domain science
experts.
In both cases, various capabilities of the TeraGrid’s open focus are often needed, such as
networking (both within the TeraGrid and to other resources, such as on campuses or in other
cyberinfrastructures), common grid software (enabling easy use of multiple TeraGrid and nonTeraGrid resources), the TeraGrid user portal, services such as metascheduling, co-scheduling,
reservations, workflows, and gateways. On the other hand, a number of the most experienced
TeraGrid users simply want low-overhead access to a single machine that best matches their needs.
Even in this category, the variety of architectures of the TeraGrid enables applications that would not
run well on simple clusters, including those that require the lowest latency and microkernel operating
systems to scale well, and those that require large amounts of shared memory. While the Track 2
systems (Ranger and Kraken) will continue to be supported even if this proposal is not funded (albeit
D-3
TeraGrid Extension: Bridging to XD
more individually), neither the four terascale IA32-64 systems currently in heavy use nor the shared
memory system (Pople) will continue to be provided to the national user committee without this
proposed work. The former systems have great potential for enabling much greater interactive usage
and science gateway support, and these latter systems are particularly useful to a number of newer
users, in areas such as game theory, web analytics, machine learning, etc, as well as being a key
part of the workflow in a number of more established applications.
This section describes a set of specific examples from the many disciplinary successes in which
TeraGrid’s deep, wide, and open mission continues to enable cutting-edge and potentially
transformative impact in science and engineering. Each lists one or more specific PIs/projects, but
they are intended to represent more general domains.
D.2.1
Geosciences: Understanding Earthquakes; SCEC, PI Tom Jordan, USC
The Southern California Earthquake Center (SCEC) is an inter-disciplinary research group with over
600 geoscientists, computational scientists and computer scientists from ~60 institutions including
the US Geological Survey. Its goal is to develop an understanding of earthquakes, and to mitigate
risks of loss of life and property damage. SCEC is an exemplar for integrated use of the distributed
resources of TeraGrid to achieve transformative geophysical science results. SCEC simulations
consist of highly scalable runs, mid-range core count runs and pleasingly parallel small-core count
runs. They require high bandwidth data transfer and large storage for post processing and data
sharing. These science results directly impact everyday life by contributing to new building codes,
emergency planning, etc., and could potentially save billions of dollars through proactive planning
and construction.
For high core-count runs, SCEC researchers use highly scalable codes (AWM-Olsen, Hercules,
AWP-Graves) on many tens of thousands of processors of the largest TeraGrid systems (TACC’s
Ranger and NICS’s Kraken) to improve the resolution of dynamic rupture simulations by an order of
magnitude and to study the impact of geophysical parameters. These codes are also used to run
high frequency (1.0 Hz currently and higher in the future) wave propagation simulations of
earthquakes on systems at TACC and PSC. Using different codes on different machines and
observing the match between the ground motions projected by the simulations is used to validate the
results. Systems are also chosen based on memory requirements for mesh and source partitioning,
which require large memory machines such as PSC’s Pople.
For mid-range core count runs, SCEC researchers are carrying out full 3D tomography (called
Tera3D) data intensive runs on NCSA’s Abe and other clusters using hundreds to thousands of
cores. SCEC researchers are also studying “inverse” problems that require running hundreds of
simulations, while perturbing the ground structure model and comparing against recorded surface
data, requiring use of multiple platforms to obtain results in a timely manner.
Another important aspect of SCEC research is in the CyberShake project, which uses 3D waveform
modeling (Tera3D) to calculate probabilistic seismic hazard (PSHA) curves for sites in Southern
California. A PSHA map provides estimates of the probability that the ground motion at a site will
exceed some intensity measure over a given time period. For each geogfrapgical point of interest,
two large-scale MPI calculations and approximately 840,000 data-intensive pleasingly parallel postprocessing jobs are required. The complexity and scale of these calculations have impeded
production of detailed PSHA maps; however, through the integration of hardware, software and
people in a gateway-like framework, these techniques can now be used to produce large numbers of
research results. Grid-based workflow tools are used to manage these hundreds of thousands of
jobs on multiple TeraGrid clusters. Over 1 million CPU hours were consumed in 2008 through this
usage model.
The high core-count simulations can produce 100-200 TB of output data. Much of this is registered
on digital libraries in NCSA and SDSC’s GPFS-WAN. In total, SCEC requires close to half a
petabyte of archival storage per year. Efficient data transfer and access to large files for the Tera3D
project is a high priority. To safely archive the datasets, redundant copies at multiple locations are
D-4
TeraGrid Extension: Bridging to XD
used. The collection of Tera3D simulations includes more than a hundred million files, with each
simulation organized as a separate sub-collection in the iRODs data grid.
The distributed TeraGrid CI, with a wide variety of HPC machines (with different number of cores,
memory/core, varying interconnect performance, etc.), high bandwidth network, large parallel and
wide-area file systems, and large archival storage, is needed to allow SCEC researchers to carry out
scientific research in an integrated manner.
D.2.2 Nanoscale Electronic Structures/nanoHUB – PI Gerhard Klimeck, Purdue University
Gerhard Klimeck’s lab is tackling the challenge of designing microprocessors and other devices at a
time when their components are dipping into the nanoscale – a billionth of a meter. The new
generation of nano-electronic devices requires a quantum-mechanical description to capture
properties of devices built on an atomic scale. This is required to study quantum dots (spaces where
electrons are corralled into acting like atoms, creating in effect a tunable atom for optical
applications), resonant tunneling diodes (useful in very high-speed circuitry), and tiny nanowires.
The simulations in this project look two or three generations forward as components continue to
shrink, projecting their physical properties and performance characteristics under a variety of
conditions before fabrication.
Klimeck’s team received an NSF Petascale Applications award for his NEMO3-D and OMEN
software development projects, aimed at efficiently using TeraGrid’s petascale systems. They have
already employed the software in multimillion-atom simulations matching experimental results for
nanoscale semiconductors, and have run a prototype of the new OMEN code on 32,768 cores of
TACC’s Ranger system. They also use TeraGrid resources at NCSA, PSC, IU, ORNL and Purdue.
Their simulations involve millions to billions of interacting electrons, and thus require highly
sophisticated and optimized software to run on the TeraGrid’s most powerful systems. Because
different code and machine characteristics may be best suited to different specific research
problems, it is extremely important for the team to plan and execute their virtual experiments on all of
these resources in a coordinated manner, and to easily transfer data between systems.
This project is also creating modeling and simulation tools for facilitating research that other
researchers, educators, and students can use through NanoHUB, a TeraGrid Science Gateway. The
PI likens the situation to making computation as easy as making phone calls or driving cars, without
being a telephone technician or an auto mechanic. Overall, nanoHUB.org is hosting more than 90
simulation tools; more than 6,200 users ran over 300,000 simulations in 2008. More than 44 classes
used the resource for teaching. The hosted codes range in computational intensity from very
lightweight to extremely intensive, such as NEMO 3-D and OMEN. According to Klimeck, it has
become an infrastructure people rely on for day-to-day operations.
nanoHUB plans to be an early tester of the TeraGrid metascheduling capabilities currently being
developed, since interactivity and reliability are high priorities for nanoHUB users. The Purdue team
is also looking at additional communities that might benefit from the use of HUB technology and
TeraGrid. The Cancer Care Engineering HUB is one such community.
D.2.3 Atmospheric Sciences (LEAD), PI Kelvin Droegemeier, University of Oklahoma
In preparation for the spring 2008 Weather Challenge, involving 67 universities, the LEAD team and
TeraGrid began a very intensive and extended “gateway-debug” activity involving Globus
developers, TeraGrid resource provider (RP) system administrators, and the TeraGrid software
integration and gateway teams. Extensive testing and evaluation of GRAM, GridFTP, and RFT were
conducted on an early CTSS v4 testbed especially tuned for stability. The massive debugging efforts
laid the foundation for improvements in reliability and scalability of TeraGrid’s grid middleware for all
gateways. A comprehensive analysis of job submission scenarios simulating multiple gateways are
being used to conduct a scalability and reliability analysis of WS GRAM. The LEAD team also
participated in the NOAA Hazardous Weather Testbed Spring 2008 severe weather forecasts. High
D-5
TeraGrid Extension: Bridging to XD
resolution on demand and urgent computing weather forecasts enable scientists to study complex
weather phenomenon in near real-time.
As part of a pilot program with Campus Weather Service (CWS) groups from atmospheric science
departments from universities across the country, Millersville University and University of Oklahoma
CWS users are predicting local weather three times per day with 5km, 4km and 2km forecast
resolutions computing on TeraGrid resources and staging data through the IU Data Capacitor. As
part of their ongoing development of reusable LEAD tools, the team is supporting the OGCE
released components – Application Factory, Registry Services and Workflow Tools. TeraGrid
supporters have generalized, packaged and tested the notification system and personal metadata
catalog to prepare for an OGCE release to be used by gateway community and will provide workflow
support to integrate with the Apache ODE workflow enactment engine.
D.2.4 Astronomy – PIs Mike Norman, UCSD, Tom Quinn, U. Washington
ENZO is a multi-purpose code (developed by Norman’s group) for computational astrophysics. It
uses adaptive mesh refinement to achieve high temporal and spatial resolution, and includes a
particle-based method for modeling dark matter in cosmology simulations, and state-of-the-art PPML
algorithms for MHD. A version that couples radiation diffusion and chemical kinetics is in
development.
ENZO consists of several components that create initial conditions, evolve a simulation in time, and
then analyze the results. Each component has quite different computational requirements; the full
set of requirements cannot be met at any single TeraGrid site. For example, the current initial
conditions generator is an OpenMP-parallel code that requires a large shared memory system such
as PSC’s Pople. The initial conditions data can be very large; the initial conditions for a 20483
cosmology are approximately 1 TB in size. Production simulation runs are done mainly on NICS’s
Kraken and TACC’s Ranger, so the initial conditions must be transferred to these sites over the
TeraGrid network using GridFTP. Similarly, the output from an ENZO simulation must generally be
transferred to a site with suitable resources for analysis and visualization, both of which typically
require large shared memory systems similar to PSC’s Pople. Furthermore, some sites are better
equipped to provide long-term archival storage of a complete ENZO simulation (of the order of 100
TB) for a period of months to years. Thus, almost every ENZO run at large scale is dependent on
multiple TeraGrid resources and the high-speed network links between the TeraGrid sites.
Quinn (University of Washington) is using the N-body cosmology code GASOLINE for analyzing Nbody simulation of structure formation in the universe. This code utilizes the TeraGrid infrastructure
in a similar fashion as the ENZO code. Generation of the initial condition, done using a serial code,
requires several 100 GB of memory and is optimally run on a shared memory system such as PSC’s
Pople. The highly scalable N-body simulations are performed on TACC’s Ranger and NICS’s Kraken
requiring the initial condition data to be transferred over the high bandwidth TeraGrid network. The
total output from the simulations can reach a few petabtyes and approximately one thousand files.
The researchers use visualization software that allows interactive steering, and they are exploring
the TeraGrid global filesystems to ease data staging for post processing and visualization.
D.2.5 Biochemistry/Molecular Dynamics – Multiple PIs (Adrian Roitberg, U. Florida, Tom
Cheatham, U. Utah, Greg Voth, U. Utah, Klaus Schulten, UIUC, Carlos Simmerling,
SUNY Stony Brook, etc.)
Many of the Molecular Dynamics (MD) users use the same codes (such as AMBER, NAMD,
CHARMM, LAMMPS, etc.) for their research, although they are looking at different research
problems, such as drug discovery, advanced materials research, and advanced enzymatic catalyst
design impacting areas such as bio-fuel research. The broad variation in the types of calculations
needed to complete various MD workflows (including both quantum and classical calculations),
along with large scale storage and data transfer requirements, define a requirement for a diverse set
of resources coupled with high bandwidth networking. This TeraGrid, therefore, offers an ideal
resource for all researchers who conduct MD simulations.
D-6
TeraGrid Extension: Bridging to XD
Classical MD runs using the AMBER and NAMD packages (as well as other commonly available MD
packages) use the distributed memory architectures present in Abe, Kraken and Ranger very
efficiently for running long time scale MD simulations. These machines allow simulations that were
not possible as recently as two years ago, and they are having enormous impact on the field of MD.
Some MD researchers use QM/MM techniques, and these researchers benefit from the existence of
machines (Abe, Kraken) with nodes that have different amounts of memory per node, as the large
memory nodes are used for the quantum calculations, and the other nodes for the classical part of
the job.
There are specific types of Quantum calculations, which are an integral part in the parameterization
of force fields, and sometimes used for the MD runs themselves in the form of hybrid QM/MM
calculations, require large shared memory machines like NCSA Cobalt or PSC Pople. These shared
memory machines lend themselves to the extremely fine grained parallelization needed for the rapid
solving of the self consistent field equations necessary for QM/MM MD simulations.
The reliability and predictability of biomolecular simulation is increasing at a fast pace and is fueled
by access to the NSF large-scale computational resources across the TeraGrid. However,
researchers are now entering a realm where they are becoming deluged by the data and its
subsequent analysis. More and more, large ensembles of simulations, often loosely coupled, are run
together to provide better statistics, sampling and efficient use of large-scale parallel resources.
Managing these simulations, performing post-processing/visualization, and ultimately steering the
simulations in real-time currently has to be done on local machines. The TeraGrid Advanced User
Support program is working with the MD researchers to address some of these limitations. Although
most researchers currently are bringing data back to their local sites to do analysis, this is quickly
becoming impractical and is limiting scientific discovery. Access to large persistent analysis space
linked to the various computational resources on the TeraGrid by the high-bandwidth TeraGrid
network is therefore essential to enabling groundbreaking new discoveries in this field.
D.2.6 Biosciences (Robetta) – PI David Baker, University of Washington
Protein structure prediction is one of the more important components of bioinformatics. The Rosetta
code, from the David Baker laboratory, has performed very well at CASP (Critical Assessment of
Techniques for Protein Structure Prediction) competitions and is available for use by any academic
scientist via the Robetta server – a TeraGrid science gateway. Robetta developers were able to use
TeraGrid’s gateway infrastructure, including community accounts and Globus, to allow researchers
to run Rosetta on TeraGrid resources through the gateway. This very successful group did not need
any additional TeraGrid assistance to build the Robetta gateway; it was done completely be using
the tools TeraGrid provides to all potential gateway developers. Google scholar reports 601
references to the Robetta gateway, including many PubMed publications. Robetta has made
extensive use of a TeraGrid roaming allocation and will be investigating additional platforms such as
Purdue’s Condor pool and the NCSA/LONI Abe-QueenBee systems.
D.2.7 Structural Engineering – Multiple NEES PIs
The Network for Earthquake Engineering Simulation (NEES) project is an NSF MRE project that
seeks to lessen the impact of earthquake and tsunami-related disasters by providing revolutionary
capabilities for earthquake engineering research. A state-of-the-art network links world-class
experimental facilities around the country, enabling researchers to collaborate on experiments,
computational modeling, data analysis and education. NEES currently has about 75 users across
about 15 universities. These users use TeraGrid HPC and data resources for various structural
engineering simulations using both commercial codes and research codes based on algorithms
developed by academic researchers. Some of these simulations, especially those using commercial
FEM codes such as Abaqus, Ansys, Fluent, and LS-Dyna, require moderately large shared memory
nodes, such as the large memory nodes of Abe and Mercury, but scale to only few tens of
processors using MPI. Large memory is needed so that the whole mesh structure can be read in to a
single node, due to the basic FEM algorithm applied in some simulation problems. Many of these
D-7
TeraGrid Extension: Bridging to XD
codes have OpenMP parallelization, in addition to using MPI, and users mainly utilize shared
memory nodes using OpenMP for pre/post processing. On the other hand, some of academic codes,
such as the OpenSees simulation package tuned for specific material behavior, can scale well on
many thousands of processors, including on Kraken, and Ranger. Due to the geographically
distributed location of NEES researchers and experimental facilities, high bandwidth data transfer
and data access are vital.
NEES researchers also perform “hybrid tests” where multiple geographically distributed structural
engineering experimental facilities (e.g., shake tables) perform structural engineering experiments
simultaneously in conjunction with simulations running on TeraGrid resources. Some complex
pseudo real-life engineering test cases can only be captured by having multiple simultaneous
experiments coupled with complementary simulations, as they are too complex to perform by either
experimental facilities or simulations alone. These “hybrid tests” require close coupling and data
transfer in real time between experimental facilities and TeraGrid compute and data resources using
the fast network. NEES as a whole is dependent on the variety of HPC resources of TeraGrid, the
high bandwidth network and data access and sharing tools.
D.2.8 Social Sciences (SIDGrid) – PI Rick Stevens, University of Chicago; TeraGrid Allocation
PI, Sarah Kenny, University of Chicago
SIDGrid is a social science team using TeraGrid to develop the only science gateway in this field,
providing unique capabilities for social science researchers. Social scientists make heavy use of
“multimodal” data, streaming data that change over time, such as a researcher collecting heart rate
and eye movement data while a human subject views a video. Collecting data many times per
second, synchronized for analysis, resultis in large datasets. Sophisticated analysis tools are
needed to study these datasets, which can involve multiple datasets collected at different time
scales. Providing these analysis capabilities through a gateway means that individual laboratories do
not need to recreate the same sophisticated analysis tools; geographically distant researchers can
collaborate on the analysis of the same data sets; and the opportunity for science impact in
increased by providing access to the highest quality data and resources to all social scientists.
SIDGrid uses TeraGrid resources for computationally intensive tasks such as media transcoding
(decoding and encoding between compression formats), pitch analysis of audio tracks and functional
Magnetic Resonance Imaging (fMRI) image analysis. These often result in large numbers of single
node jobs. TeraGrid roaming and TACC’s Spur and Ranger are currently used. Workflow tools (e.g.,
SWIFT) are very useful in job management. Active users of SIDGrid include a human neuroscience
group and linguistic research groups from U. Chicago and U. Nottingham. TeraGrid provides support
to make more effective use of the resources. A new application framework has been developed to
enable users to easily deploy new social science applications in the SIDGrid portal.
D.2.9 Biosciences – PI George Karniadakis, Brown University
High-resolution, large-scale simulations of a blood flow in the human arterial tree (HAT) require
solution of flow equations with billions degrees of freedom. Tens or hundreds of thousands computer
processors are needed to perform such computationally demanding simulations. Use of a network of
distributed computers (TeraGrid) presents an opportunity to carry out these simulations efficiently;
however, new computational methods must be developed. The HAT project has developed a new
scalable approach for simulating large multiscale computational mechanics problems on a network
of distributed computers. The method has been successfully employed in cross-site simulations
connecting SDSC, TACC, PSC, UC/ANL, and NCSA.
The project considers 3D simulation of blood flow in the intracranial arterial tree using NEKTAR, the
spectral/hp element solver developed at Brown. It employs a multi-layer hierarchical approach
whereby the problem is solved on two layers. On the inner layers, solutions of large tightly coupled
problems are performed simultaneously on different supercomputers, while on the outer layer, the
solution of the loosely coupled problem is performed across distributed supercomputers, involving
considerable inter-machine communication. The heterogeneous communication topology (i.e., both
D-8
TeraGrid Extension: Bridging to XD
intra- and inter-machine communication) is performed with MPIg. MPIg's multithreaded architecture
allows applications to overlap computation and inter-site communication on multicore systems.
Cross-site computations performed on the TeraGrid's clusters demonstrate the benefits of MPIg over
MPICH-G2. The multi layer communication interface implemented in NEKTAR permits efficient
communication between multiple groups of processors. This methodology is suitable for solution of
multi-scale and multi-physics problems on distributed and on modern petaflop supercomputers.
D.2.10 Neutron Science – PI John Cobb, ORNL
The Neutron Science TeraGrid Gateway (NSTG) project is an exemplar for the use of CI for
simulation and data analysis coupled to an experiment. The unique contributions of NSTG are the
connection of national user facility instrument data sources to the integrated TeraGrid CI and the
development of a neutron science gateway that allows scientists to use TG resources to analyze
their data, and compare experiment with simulation. The NSTG is working closely with the Spallation
Neutron Source (SNS) at ORNL as the principal facility partner. The SNS is a next-generation
neutron source, which has completed construction and is ramping up operations. The SNS will
provide an order of magnitude greater flux than any other neutron scattering facility and will be
available to all of the nation's scientists, on a reviewed basis. With this new capability, the neutron
science community is facing orders of magnitude larger data sets and is at a critical point for data
analysis and simulation. They recognize the need for new ways to manage and analyze data to
optimize both beam time and scientific output. TeraGrid provides new simulation capabilities by
using McStas and new data analysis capabilities by the development of a fitting service. Both run on
distributed TeraGrid resources (ORNL, TACC and NCSA) to improve turnaround. NSTG is also
exploring archiving experimental data on TeraGrid. As part of the SNS partnership, the NSTG
provides gateway support, CI outreach, community development, and user support for the neutron
science community, including SNS staff and users, all five neutron scattering centers in North
America, and several dozen worldwide.
D.2.11 GIScience – PI Shaowen Wang, University of Illinois
The GIScience gateway, a geographic information systems (GIS) gateway, has over 60 regular
users and is used by undergraduates in coursework at UIUC. GIS is becoming an increasingly
important component of a wide variety of fields. The GIScience team has worked with researchers in
fields as distinct as ecological and environmental research, biomass-based energy, linguistics
(linguist.org), coupled natural and human systems and digital watershed systems, hydrology and
epidemiology. The team has allocations on resources in TeraGrid ranging from TACC’s Ranger
system to NCSA’s shared memory Cobalt system to Purdue’s Condor pool and Indiana’s BigRed
system. Most usage to date has been on the NCSA/LONI Abe-QueenBee systems. The GIScience
gateway may also lead to collaborations with the Chinese Academy of Sciences through the work of
the PI.
D.2.12 Computer Science: Solving Large Sequential Two-person Zero-sum Games of Imperfect
Information – PI Tuomas Sandholm, Carnegie Mellon University
While many games can be formulated mathematically, the formulations for those that best represent
the challenges of real-life human decision making (in national defense, economics, etc.) are huge.
For example, two-player poker has a game-tree of about 1018 nodes. In the words of Sandholm's
Ph.D. student Andrew Gilpin, “[It] … requires massive computational resources. Our research is on
scaling up game-theory solution techniques to those large games, and new algorithmic design.” The
most computationally intensive portion of Sandholm and Gilpin's algorithm is a matrix-vector product,
where the matrix is the payoff matrix and the vector is a strategy for one of the players. This
operation accounts for more than 99% of the computation, and is a bottleneck to applying game
theory to many problems of practical importance. To drastically increase the size of problems the
algorithm can handle, Gilpin and Sandholm devised an approach that exploits massively parallel
systems of non-uniform memory-access architecture, such as PSC’s Pople. By making all data
D-9
TeraGrid Extension: Bridging to XD
addressable from a single process, shared memory simplifies a central, non-parallelizable operation
performed in conjunction with the matrix-vector product.
D.2.13 Chemistry (GridChem) – Project PI John Connolly, University of Kentucky; TeraGrid
Allocation PI Sudhakar Pamidighantam, NCSA
Computational chemistry forms the foundation not only of chemistry, but is required in materials
science and biology as well. Understanding molecular structure and function are beneficial in the
design of materials for electronics, biotechnology and medical devices and also in the design of
pharmaceuticals. GridChem, an NSF Middleware Initiative (NMI) project, provides a reliable
infrastructure and capabilities beyond the command line for computational chemists. GridChem, one
of the most heavily used TeraGrid science gateways in 2008, requested and is receiving advanced
support resources from the TeraGrid. This advanced support work will address a number of issues,
many of which will benefit all gateways. These issues include common user environments for
domain software, standardized licensing, application performance characteristics, gateway
incorporation of additional data handling tools and data resources, fault tolerant workflows,
scheduling policies for community users, and remote visualization. This collaboration with TeraGrid
staff is ongoing in 2009.
D.3 Advanced Capabilities Enabling Science
The transformative science examples in §D2 are enabled by the coordinated efforts of the TeraGrid
project. The TeraGrid advanced capabilities (those delivered to the user community above simple
access to computer cycles) are developed based on existing and expected user needs. These
needs are determined from direct contact with users, surveys, discussions with potential users, and
collaborations with other CI projects. TeraGrid advanced CI capabilities are tested by projects that
expressed interest in the CI features. In other cases, such as for advanced user support, where
there is more need for a capability than we can deliver, we use the allocations process to receive
recommendations on where we should apply our efforts. All the projects described in §D2 are driving
or using TeraGrid advanced capabilities, and will continue to do so during the extension period, in
order to obtain the best possible science results. We will support new data capabilities as a central
component of whole new forms of data-intensive research, especially in combination with advanced
visualization and community interfaces such as those supported by the gateways efforts. We will
continue to enhance these advanced capabilities in a collaborative context, with TeraGrid staff
bringing their expertise to bear on user needs to improve the experiences of current users of
TeraGrid, and to help develop the new generation of XD users.
D.3.1
Advanced User Support
“I consider the user
Advanced User Support (AUS) plays a critical role in enabling
support people to be the
science in the TeraGrid, both with regard to ultra high end users,
most valuable aspect of the
such as petascale users, and less traditional users of CI, such
TeraGrid because the
as users whose research focuses primarily on data analysis or
infrastructure is only as
visualization, and users in areas such as the social sciences.
good as the people who run
AUS staff (computational scientists from all RP sites, with Ph.D
and support it.” – Martin
level expertise in various domain sciences, HPC, CS,
Berzins, University of Utah
visualization, and workflow tools) will be responsible for the
highest level support for TeraGrid users. The overall advanced
support efforts under the AUS area will consist of three sub-efforts: Advanced Support for TeraGrid
Applications (ASTA), Advanced Support for Projects (ASP), and Advanced Support for EOT
(ASEOT).
AUS operations will be coordinated by the AUS Area Director jointly with the AUS Point of Contacts
(POCs) from the RP sites; together they will handle the management and coordination issues
associated with ASTA, ASP and ASEOT. They have created an environment of cooperation and
D-10
TeraGrid Extension: Bridging to XD
collaboration among AUS technical staff from across the RP sites where AUS staff benefit from each
other’s expertise and work jointly on ASTA, ASP and ASEOT projects.
D.3.1.1 Advanced Support for TeraGrid Applications (ASTA)
ASTA projects allow AUS staff to work with a PI for a period of few months to a year, so that the
project is able to optimally use TeraGrid resources for science research. As has been shown in the
past, ASTA efforts will be vital for many of the ground-breaking simulations performed by TeraGrid
users. ASTA work will include porting applications, transitioning them from outgoing to incoming
TeraGrid resources, implementing algorithmic enhancements, implementing parallel programming
methods, incorporating mathematical libraries, improving the scalability of codes to higher core
counts, optimizing codes to utilize specific resources, enhancing scientific workflows, and tackling
visualization and data analysis tasks. To receive ASTA support TeraGrid users submit a request as
a part of allocation request. Next, allocations reviewers provide a recommendation score, AUS staff
work with the user to define an ASTA work plan, and finally, AUS staff provide ASTA support to the
user. The AUS effort optimally matches TeraGrid-wide AUS staff to ASTA projects, taking into
account the reviewers’ recommendation, AUS staff expertise in relevant domain science/HPC/CI, the
ASTA project work plan, and the site where the user has an allocation. ASTA projects provide longterm benefits to the user team, other TeraGrid users, and the TeraGrid project. ASTA project results
provide insights and exemplars for the general TeraGrid user community; they are included in
documentation, training and outreach activities. ASTA efforts also allow us to bring in new user
communities, from social science, humanities etc. and enable them to use TeraGrid resources. And,
ASTA insights help us understand the need for new TeraGrid capabilities.
D.3.1.2 Advanced Support Projects (ASP)
The complex, ever changing, and leading edge nature of the TeraGrid infrastructure necessitates
identifying and undertaking advanced projects that will have significant impact on a large groups of
TeraGrid users. ASPs are identified based on the broad impact they will have on the user
community, by processing input from users, experienced AUS and frontline support staff and other
TeraGrid experts. AUS staff expertise in various domain sciences and experience in HPC/CI, along
with deep understanding of users’ needs, play an important role in identifying such projects. ASP
work will include (1) porting, optimizing, benchmarking and documenting widely-used domain
science applications from outgoing to incoming TeraGrid machines; (2) addressing the issues in
scaling these applications to tens of thousands of cores; (3) investigating and documenting optimal
use of the data-centric, high-throughput, Grid research, and experimental Track-2D systems; (4)
demonstrating feasibility and performance of new programming models (PGAS, hybrid
MPI/OpenMP, MPI one-sided communication etc.); (5) providing technical documentation on
effective use of profiling, tracing tools on TeraGrid machines for single processor and parallel
performance optimization, (6) providing usage-based visualization, workflow, and data
analysis/transfer use cases.
D.3.1.3 Advanced Support for EOT
In this area, AUS staff provide their expertise in support of education, outreach and training. AUS
staff will contribute to advanced HPC/CI training (both synchronous and asynchronous) and teach
such topics. AUS staff will provide outreach to the TeraGrid user community about the transition to
XD, and on the process for requesting support through ASTA and ASP. In this regard, AUS staff will
reach out to the NSF program directors that fund computational science and CI research projects.
AUS staff will be involved in planning and organizing TG10, SC2010 and other workshops and
attending and presenting at these workshop, BOFs, and panels. We will provide outreach to other
NSF CI programs (e.g., DataNet, iPlant, etc.) and enable them to use TeraGrid resources. We will
pay special attention to broadening participation of underrepresented user groups and provide
advanced support as appropriate and under the guidance of the allocation process.
D-11
TeraGrid Extension: Bridging to XD
D.3.2 Advanced Scheduling and Meta Scheduling
TeraGrid systems have traditionally been scheduled independently, with each system’s local
scheduler optimized to meet the needs of local users. Feedback from TeraGrid users, user surveys,
review panels, and the science advisory board, has indicated emerging user needs for coordinated
scheduling capabilities. In PY2, our scheduling and metascheduling requirements analysis teams
(RATs) identified advance reservation, co-scheduling, automatic resource selection (aka
metascheduling), and on-demand (aka urgent) computing as the most needed capabilities. We
formed a scheduling working group (WG) in late PY2/early PY3. In PY3 and PY4, the WG defined
several TeraGrid-wide capability definitions and implementation plans that are now being used to
finalize production support in the remainder of PY4. Maintenance of these capabilities is described in
§D.5.3.
We are currently moving three of these capabilities into production: on-demand/urgent computing,
advance reservation, and co-scheduling. Automatic resource selection is available, but only for the
two IA64 systems at SDSC and NCSA, with another four systems to be added in PY4 and PY5.
Although it is not yet clear what the level of demand for these services will be, we have ample
evidence that they will be used by some TeraGrid users (such as was described in examples in §D2)
for innovative, high-impact scientific explorations. The first two years of use will reveal unanticipated
requirements and limitations of the technology.
We propose to allow user needs over the next two years of operation to drive the work in this area
and to allocate a modest budget to meeting these needs. It seems likely at this time that at least two
priorities will be evident in the TeraGrid Extension period: the need to extend our advanced
scheduling capabilities to new resources as they are added, and the need to establish standard
mechanisms with peer systems (e.g., OSG, UK National Grid Service, LHC Computing Grid) that
allow users to integrate their scientific activities on these systems. The existing IA32-64 architectures
(Abe, Lonestar, QueenBee, Steele) that would continue to be available to TeraGrid users during this
TeraGrid Extension will be used to production test these services under load (§D.5.9.1).
D.3.3
Advanced Data Services
Data requirements of the scientific community have been increasing at a rapid rate, both in size and
complexity. With the HPC systems increasing in both capacity and capability, and the generation
and use of experimental and sensor data also increasing, this trend is unlikely to change. This
means that we must continue TeraGrid’s efforts to provide reliable data transfer, management and
archival capabilities. The data team has studied the data movement and management patterns of
TeraGrid’s current user community, and developed a data architecture plan that is being
implemented by the RPs and the Data working group. Further effort to implement the data
architecture and its component pieces will help users with their current concerns and provide an
approach that will persist into XD. High-performance data transfers, more sophisticated metadata
and data management capabilities, global file systems for data access, and archival policies are all
essential parts of the plan, and we will work to integrate them into production systems and
operations. A consistent, high-level approach to data movement and management in the TeraGrid is
necessary to respond to ongoing feedback from TeraGrid users and to support their needs.
D.3.3.1 Global Wide Area File Systems
Global Wide Area File Systems always rank at the top or near to the top of user requirements within
the TeraGrid; significant strides have been made in their implementation, but several more are
needed before they can become ubiquitous.
We have committed to a project-wide implementation of Lustre-WAN as a wide area file system,
available in PY5 at a minimum of three RP sites. The IU Data Capacitor WAN file system (984 TB
capacity) is mounted on two resources now and we are continuing efforts to expand the number of
production resources with direct access to this file system. Future development plans focus on
increasing security and performance through the provision of distributed storage physically located
D-12
TeraGrid Extension: Bridging to XD
near HPC resources. New effort in the TeraGrid Extension provides $100k for additional hardware at
PSC, NCSA, IU, NICS, TACC to expand this distributed storage resource, and also provides 0.50
FTE at PSC, NCSA, IU, NICS, TACC, SDSC for support in deployment. This will deploy additional
Lustre-WAN disk resources as part of a wide area file system to be available on all possible
resources continued in the TeraGrid Extension. We will also deploy wide-area file systems on Track
2d and XD/Remote Visualization awardee resources as appropriate.
This effort will also support SDSC’s GPFS-WAN (700 TB capacity), which will continue to be
available and will support data collections. It may be used within the archival replication project as a
wide-area file system or high-speed data cache for transfers. If appropriate, hardware resources
could be redirected to participate in a TeraGrid-wide Lustre-WAN solution.
pNFS is an extension to the NFS standard that allows for wide area parallel file system support
using an interoperable standard. If pNFS clients are provided by system vendors, pNFS could
obviate issues with licensing and compatibility that currently present an obstacle to global
deployment of wide-area file systems. More development and integration with vendors is necessary
before pNFS can be seen as a viable technology for production resources within the TeraGrid, but
these developments are highly likely to occur with the timeframe of this extension. We will also
continue investigating other alternatives (e.g. ReddNet, PetaShare)
D.3.3.2 .Archive Replication
Archival replication services are an area of recognized need, and a separate effort will be
undertaken to provide software to support replication of data across multiple TeraGrid sites. Ongoing
effort will be required, however, to support users and applications accessing the archives and
replication services. In addition, management of data and metadata in large data collections, across
both online and archive resources, is an area of growing need. The data architecture team will work
with the archival replication team to ensure smooth interaction between existing data architecture
components and the archival replication service, and to study and document archival practices,
patterns and statistics regarding usage by the TeraGrid user community.
D.3.3.3 Data Movement Performance
The data movement performance team has been instrumental in mapping and instrumenting the use
of data movement tools across TeraGrid resources and from there to external locations. This team is
implementing scheduled data movement tools including interfaces to the TeraGrid User Portal. After
these tools are in place by the end of PY5, we will take advantage of performance and reliability
enhancements in data movement technologies and work jointly with the QA team in this regard.
D.3.3.4 Visualization and Data Analysis
Visualization and data analysis services funded in the TeraGrid Extension period will be focused
exclusively on visualization consulting and user support through the deployment and development of
tools required by the user community. Both visualization and data analysis at the petascale continue
to present significant challenges to the user community and require collaboration with visualization
and data analysis experts. Additionally, the need for the deployment of more sophisticated data
analysis capabilities is becoming more apparent as shown by user requests. Data analysis often
benefits from large shared memory machines such as will be available on Pople under this
extension. Visualization and data analysis have traditionally relied heavily on the value-added
resources and services at RP sites, and these services continue to be a critical need identified by
the user community. Building upon the work at the RP sites and the anticipated introduction of two
new data and visualization analysis resources, the TeraGrid Extension period efforts will continue to
focus on an integrated, documented visualization services portfolio created with two goals; 1) to
provide the user community with clarity in terms of where to turn for visualization needs; and 2) to
effectively define a set of best practices with respect to providing such services, enabling individual
campuses to harvest the experience of the TeraGrid RP sites. Additionally, visualization consulting is
a growing need for the TeraGrid user community, particularly at the high end. We will leverage
D-13
TeraGrid Extension: Bridging to XD
existing capabilities at the RP sites in addition to the new XD remote visualization resource sites to
provide consistent, knowledgeable visualization consulting to the TeraGrid user communty.
D.3.3.5 Visualization Gateway
The TeraGrid Visualization Gateway development will expand the capabilities and provide the ability
for additional services and resources to be included. With community access to the TeraGrid
Visualization Gateway via community allocations and dynamic accounts now available, we will
emphasize educating the user community on using of this capability, and providing a uniform
interface for visualization and data analysis capabilities. Providing centralized information about and
access to such capabilities will benefit users. We will also build upon the work at the RP sites to
expose these resources and services through the TeraGrid Visualization Gateway.
D.3.4 Science Gateways
Gateways provide community-designed interfaces to TeraGrid resources, extending the command
line experience to include access to datasets, community collaboration tools, visualization
capabilities, and more. TeraGrid provides resources and support for 35 such gateways with
additional gateways anticipated. Section D.2 above illustrates the transformative impact that
TeraGrid gateways have already made on computational science across multiple domains: of the 13
examples described there, nine science programs (SCEC, SIDGrid, NEES, NSTG, GridChem,
Robetta, GIScience, nanoHUB and LEAD) operate and develop gateways. They allow researchers,
educators and students in Geosciences, Social Sciences, Astronomy, Structural Engineering,
Neutron Science, Chemistry, Biosciences, and Nanotechnology to benefit from leading-edge
resources without having to master the complexities of programming, adapting, testing, and running
leading-edge applications.
The Science Gateways program works to identify common needs across projects and work with the
other TeraGrid Areas to prioritize meeting these needs. Goals for the TeraGrid Extension period
include a smoothly functioning, flexible, and effective gateway targeted-support program,
streamlined access to community accounts, and production use of attribute-based authentication.
D.3.5 Gateway Targeted Support Activities
The gateway targeted support program, perhaps the most successful part of the gateway program,
provides assistance to developers wishing to integrate TeraGrid resources into their gateways.
Targeted support is available to any researcher, and requests are submitted through the TeraGrid
allocation process. As diverse requests are received, a team of staff members is flexible and ready
to support approved requests. Requests can come from any discipline and can vary widely between
gateways. One gateway may be interested in adding fault tolerance to a complex, existing workflow.
Another may have not used any grid computing software previously and need help getting started. A
third may be interested in using sophisticated metascheduling techniques. Outreach will be
conducted to make sure that underrepresented communities are aware of the targeted support
program.
The TeraGrid Extension period targeted support projects will be chosen through the TeraGrid’s
planning process which starts with an articulation of objectives to reflect both the progress achieved
in PY5 and the need for a smooth transition to XD.
To illustrate the type of projects included in targeted support and to describe the work upon which
the TeraGrid Extension period activities will build, we describe here some of the targeted support
projects planned for PY5:


Assist the GridChem gateway in the areas of common chemistry software access across RP
sites, data management, improved workflows, visualization and scheduling.
Assist PolarGrid with TeraGrid integration. This may include real-time processing of sensor
data, support for parallel simulations, GIS integration, and EOT components.
D-14
TeraGrid Extension: Bridging to XD





Prototype creation of an OSG cloud on TeraGrid resources via NIMBUS and work with OSG
science communities to resolve issues.
Augment SIDGrid with scheduling enhancements, improved security models for community
accounts, data sharing capabilities and workflow upgrades. Lessons learned will be
documented for other gateways and projects.
Develop and enhance the simple gateway framework, SimpleGrid. Within this effort, we plan
to augment the online training service for building new science gateways, develop
prototyping service to support virtualized access to TeraGrid, develop a streamlined
packaging service for new gateway deployment, develop a user-level TeraGrid usage
service within SimpleGrid based on the community account model and attributes-based
security services, work with potential new communities to improve the usability and
documentation of the proposed gateway support services, and conduct education and
outreach work using the SimpleGrid online training service.
Adapt the Earth System Sciences to use the TeraGrid via a semantically enabled
environment that includes modeling, simulated, and observed data holdings, and
visualization and analysis for climate and related domains. We will build upon synergistic
community efforts including the Earth System Grid (ESG), Earth System Curator (ESC),
Earth System Modeling Framework (ESMF), the Community Climate System Model (CCSM)
Climate Portal (developed at Purdue University), and NCAR’s Science Gateway Framework
(SGF) effort. Extend the Earth System Grid-Curator (ESGC) Science Gateway so that CCSM
runs can be initiated on TeraGrid.
Extend Computational Infrastructure for Geodynamics (CIG) gateway to support running
parameter sweeps through regions of the input parameter space on TeraGrid. For example,
the SPECFEM3D code computes a simulation of surface ground motion at real-world
seismological recording stations according to a whole-earth model of seismological wave
propagation. Multiple parameter sweep runs produce ‘synthetic seismograms’ that are
compared with measured ground motions.
The TeraGrid Extension period projects will be selected during the TeraGrid planning process.
Some groups who have expressed interest in the gateway program, and with whom we have not
yet worked extensively, include:






Center for Genomic Sciences (CGS), Allegheny-Singer Research Institute/Allegheny
General Hospital, is interested in using the TeraGrid for genome sequencing via a
pyrosequencing platform from Roche. Computing would run on the TeraGrid rather than on
local clusters that are now required by the Roche platform and seen as a barrier to entry for
some users.
The Center for Analytical Ultracentrifugation of Macromolecular Assemblies, University of
Texas Health Science Center at San Antonio, runs a centrifuge and maintains analysis
software. They would like to port the analysis software to the TeraGrid and incorporate
access into a gateway.
The director of Bioinformatics Software at the J. Craig Venter Institute is interested in
developing a portal to National Institute of Allergy and Infectious Diseases (NIAID)
Bioinformatics Resource Centers.
San Diego State University (SDSU) is interested in developing a TeraGrid Gateway for a
NASA proposal entitled “Spatial Decision Support System for Wildfire Emergency Response
and Evacuation.” The gateway would automate the data collection, data input formatting, GIS
model processing, and rendering of model results on 2D maps and 3D globes and run the
FARSITE (Fire Area Simulator) code from the US Forest Service.
The Cyberinfrastructure for Phylogenetic Research (CIPRES) project is interested in
incorporating TeraGrid resources into their portal in order to serve an increasing number of
researchers.
The minimally funded gateway component of the TeraGrid Pathways program could be
expanded via a targeted support project.
D-15
TeraGrid Extension: Bridging to XD

SDSU is also interested in developing a TeraGrid gateway to provide a Web-enhanced
Geospatial Technology (WGT) Education program through the geospatial cyberinfrastructure
and virtual globes. High school students at five schools would be involved in gateway
development.
D.3.5.1 Gateway Support Services
In addition to supporting individual gateway projects, TeraGrid staff provide and develop general
services that benefit all projects, including gateways. These activities include help desk support
(answering user questions, routing user requests to appropriate gateway contacts, and tracking user
responses), documentation, providing relevant input for the TeraGrid Knowledgebase, SimpleGrid
for basic gateway development and teaching, gateway hosting services, a gateway software registry,
and security tools (including the Community Shell, credential management strategies, and attributebased authentication).
While community accounts increase access, they also obscure the number of researchers using the
account and therefore using the TeraGrid. To capture this information automatically, in PY5 we are
implementing attribute-based authentication, through the use of GridShib SAML tools. This allows
gateways to send additional attributes via the credentials used to submit a job. These attributes are
stored in the TeraGrid central database, allowing TeraGrid to query the database for the number of
end users of each gateway using community accounts. Additional capabilities include the ability to
blacklist individual gateway users or IPs so the gateway can continue to operate in the event of a
security breach. TeraGrid can also provide per-user accounting information for gateways. GridShib
SAML tools and GridShib for Globus Toolkit have been released for the CTSS science gateway
capability kit. The release includes extensive documentation for gateway developers and resource
providers.
In the TeraGrid Extension period, we will standardize the implementation and documentation of
community accounts across the RPs. Maintaining and updating these standards will make the
integration of new systems into TeraGrid straightforward, which directly supports gateways in their
use of community accounts.
D.4 Supporting the User Community
The TeraGrid user community as a whole, including the users contributing to the transformative
science examples in §D2, depends on and benefits tremendously from the range of support services
that TeraGrid provides. In the 2008 TeraGrid user survey, “the helpfulness of TeraGrid user support
staff” (84%), and “the promptness of user support ticket resolution” (82%) received the highest
satisfaction ratings of all TeraGrid resources. Transformative science on the TeraGrid is possible
due to the resources and services working together in concert, which requires a coordinated user
support system comprised of centralized mechanisms for user access, the TeraGrid-wide allocations
management process, a comprehensive user information infrastructure, the production of user
information content, a frontline user support system and user training efforts. Each of these functions
supports TeraGrid’s focus on delivering deep, wide, and open cyberinfrastructure to address the
diverse user needs and requirements.
We propose to continue and further improve this user support system, which requires interlocking
activities from several project areas: user information and access environment (§D.4.1); user
authentication and allocations (§D.4.2); frontline user support (§D.4.3) and training (§D.4.4), as well
as the advanced support capabilities proposed in section §D.3 above.. The User Facing Projects
and Core Services (UFP) area oversees the activities described in sections §D.4.1 and §D.4.2
below; the User Services area coordinates the activities of Frontline User Support described in
section §D.4.3; multiple areas (User Services, Advanced User Services, and the Education,
Outreach and Training) are working in unison to coordinate, develop and deliver HPC training on
topics requested by users. All of these activities will continue to be coordinated by the User
Interaction Council (UIC) for day-to-day collaboration among the directors of the abovementioned
D-16
TeraGrid Extension: Bridging to XD
areas,, with the GIG Director of Science participating. Their strategic and operational perspectives
will be essential in ensuring continued user success as the resource mix changes and the TeraGrid
program transitions to the XD awardee(s)
D.4.1 User Information and Access Environment
User access to the resources at RP sites is supported through a coordinated environment of user
information and remote access capabilities. This objective ensures that users are provided with
current, accurate information from across the TeraGrid in a dynamic environment of resources and
services.
Building on a common Internet backend infrastructure, the UFP team maintains and updates the
TeraGrid web site, the TeraGrid User Portal (TGUP), and the KnowledgeBase. In 2008, these sites
delivered more than half a million web, portal, and KnowledgeBase hits each month.
The TGUP is the central user environment that allows users to access and use resources across RP
sites. The TGUP provides single-sign-on capability for RP resources, a multi-site file manager,
remote visualization, queue prediction services, and training events and resources. In 2009, the
portal plans to expand its interactive services by deploying job submission and metascheduling
capabilities. Furthermore, the TGUP plans to expand its customization features to give users a
personalized TeraGrid experience that caters to their requirements and scientific goals. This
includes presenting information in domain views, listing domain-related software, enabling user
forums, and allowing users to share information with other TeraGrid users in their field of science.
To help individual users as well as the providers of community gateways, UFP develops and
operates a suite of resource and software catalogs, system monitors, and TeraGrid’s central user
news service. Such services are essential to providing up-to-the-minute information about a dynamic
resource environment. UFP services, such as the Resource Description Repository, leverage
TeraGrid’s investment in Information Services wherever possible to minimize the RP effort needed to
integrate resources.
The team also produces and delivers central TeraGrid-wide documentation from a central content
management system and provides the Knowledgebase to provide answers to frequently asked
questions. The UFP team also maintains processes to provide quality assurance to the user
information we deliver. This includes managing web pages, posting new and updated
documentation, working with the External Relations group and all relevant subject-matter experts,
and continually developing and updating Knowledgebase articles.
During the TeraGrid Extension period, we will continue to update and enhance the current set of
user access and information offerings, prioritizing based on user requests and the evolving TeraGrid
resource environment.
D.4.2 User Authentication and Allocations
The UFP team operates and manages the procedures and processes—adapting and updating its
services to evolving TeraGrid policies as necessary—for bringing users into the TeraGrid
environment and establishing their identity; and making allocations and authorization decisions for
use of TeraGrid resources. During this extension to TeraGrid, these procedures and processes will
need to support users through the changes to the TeraGrid resource portfolio resulting from the
transition to the XD awardee(s).
By providing a common access and authentication point, UFP supports TeraGrid's common user
environment, simplifies multi-site access and usage, and hides the complexity of working with
multiple RP sites through such capabilities as single sign-on and Shibboleth integration. The
integration of the community-adopted Shibboleth will allow TeraGrid to scale to greater numbers of
users with the same staffing levels and permit users to authenticate once to access both TeraGrid
and local campus resources. The central authorization and allocation mechanisms supported by
D-17
TeraGrid Extension: Bridging to XD
UFP make cross-site activities possible with minimal effort, make it easier for PIs to share allocations
with students and colleagues, eliminate duplication of effort among RPs, and reduce RP costs.
Through the TGUP, a user will create his or her TeraGrid identity and authenticate using either
TeraGrid- or campus-provided credentials. Once a TeraGrid identity is established, any eligible user
can then request allocations (as the PI on a TeraGrid project) or be authorized to use resources as
part of an existing project. In 2008, current UFP processes added 1,862 new users to the TeraGrid
community (148 more than in 2007), and the TGUP, web site, and Knowledgebase recorded
thousands of unique visitors each month.
The TeraGrid allocations processes are a crucial operational function within UFP. In particular, the
UFP area implements the TeraGrid policies for accepting, reviewing, and deciding requests for
Startup, Education, Research and ASTA allocations. These procedures include managing the
quarterly meetings of the TeraGrid Resource Allocations Committee (TRAC), and coordinating an
impartial, multidisciplinary panel of nearly 40 computational experts. To ensure appropriate and
efficient use of resources, the TRAC reviews Research and ASTA requests and recommends
allocation amounts for PIs who wish to use significant fractions of the available resources. In 2008,
this review process covered more than 300 requests for hundreds of millions of HPC core-hours and
about fifty ASTA projects. In addition to the quarterly TRAC process, the UFP team is responsible for
the ongoing processing of Startup and Education requests, Research project supplements and
TRAC appeals, as well as extensions, transfers, and advances. More than 750 Startup and
Education requests were submitted and processed in 2008.
During the TeraGrid Extension period, we will continue to develop improvements to the
authentication and allocations interfaces and processes. These will encompass enhancements to
the submission interface of POPS (the System for TeraGrid Allocation Requests) based on user
feedback and policy changes, further integration of POPS and TGUP, and reducing the time it takes
from a user’s first encounter with TeraGrid to his or her first access to resources.
D.4.3 Frontline User Support
We propose to continue and further improve the frontline user support structure that has made the
TeraGrid a successful enabler of breakthrough science. This will comprise the TeraGrid Operations
Center (TOC) at NCSA and the user services working group, which assembles user consulting staff
from all the RP sites under the leadership of the Area Director for User Support. Users will submit
problem reports to the TOC via email to help@teragrid.org, by web form from the TeraGrid User
Portal, or via phone (866.907.2383). Working 24/7, the TOC will create a trouble ticket for each
problem reported, and track its resolution until it is closed. The user will be automatically informed
that a ticket has been opened and advised of the next steps.
If a ticket cannot be resolved within one hour at the TOC itself, it is assigned to a member of the user
services working group, who begins by discussing the matter with the user. The consultant may
request the assistance of other members of the working group, advanced support staff, systems, or
vendor experts. The immediate goal is to ensure that the user can resume his or her scientific work
as soon as possible, even if addressing the root cause requires a longer-term effort. When a
proposed root-cause solution becomes available, we contact the affected users again and request
their participation in its testing. Strategies that are identified that will benefit other users are
incorporated into the documentation, Knowledge Base and training materials to benefit all users.
TeraGrid frontline support will also continue to take a personal, proactive approach to preventing
issues from arising in the first place, and to improve the promptness and quality of ticket resolution.
This will done by continuing the User Champions program, in which RP consultants are assigned
to each TRAC award by discussion in the user services working group, taking into account the
distribution of an allocation across RP sites and machines, and the affinity between the group and
the consultants based on expertise, previous history, and institutional proximity. The assigned
consultant contacts the user group as their champion within the TeraGrid, and seeks to learn about
their plans and issues.
D-18
TeraGrid Extension: Bridging to XD
We will continue to leverage the EOT area's Campus Champions program to fulfill this same contact
role with respect to users on their campuses, especially for Startup and Education grants. Campus
Champions are enrolled as members of the user services working group, and thus are being trained
to become "on site consultants" extending the reach of TeraGrid support.
We propose user engagement and sharing and maintaining best practices as the ongoing focus of
user support coordination. This will allow us to effectively assist the user community in the transition
to a new TeraGrid resource mix and organizational structure through the XD program.
D.4.3.1 User Engagement
The user support team will provide the TeraGrid with ongoing feedback by means of surveys as well
as day-to-day personal interaction. The 2011 TeraGrid user survey will be designed and
administered by a professional evaluator selected by the GIG. Topics to be included in the survey,
the population to be surveyed, and the analysis of the results will be iterated between the evaluator
and the TeraGrid ADs, working groups, and Forum, with feedback from the SAB, with the US area
director functioning as the process driver. The final report on the 2011 user survey will be complete
by March 15, 2011.
Personal interaction between users and the TeraGrid consultants will continue to be essential in
providing us with feedback on a day-to-day basis. This process will be coordinated in the user
services working group, via the User Champions and Campus Champions programs. The nature of
the problems encountered will inform the selection of Advanced Support for Projects activities
(§D.3.1.2). The Campus Champions programs will be employed to enlist appropriate users as
testers for proposed new TeraGrid resources and CTSS capabilities that specifically address these
users' priority needs and interests. In particular, we will support the Software Integration (§D.5.1),
Quality Assurance (§D.5.6) and Common User Environment (§D.5.7) teams' work. The program plan
for PY5 calls for the user support and UFP teams to formulate a plan to realize the potential of social
networking mechanisms , such as moderated forums, for user engagement. This capability will enter
production in the TeraGrid Extension period. In PY5 we will also create a repository of the user
suggestions obtained by various methods, and how they are followed up. In the TeraGrid Extension
period, our experiences will populate this repository. Share and Maintain Best Practices for Ticket
Resolution
In the user services working group, the US area director and coordinators will continue to focus on
helping the consultants at all the RPs to ensure that the time to suggesting a solution to the user is
minimized, and that progress in resolving a ticket is communicated to the user at least once a week.
The discussion of pending tickets and lessons learned from recently closed ones will continue to be
a standing item at every meeting of the working group. The working document outlining Ticket
Resolution Guidelines will continue to be refined based on the real life operational experiences
encountered, with the ever more complex user workload and TeraGrid resource menu. The
guidelines provide for lessons learned from each problem to be fed into the TeraGrid's
documentation, training, and user feedback processing systems. They show how to recognize user
problems that may require advanced support and how to help the user apply for advanced support.
D.4.4 Training
Our training goal is to prepare users to make effective use of the TeraGrid resources and services to
advance scientific discovery. Objectives include: regular assessment of users needs and
requirements; development of HPC training materials that allow the research community to make
effective use of current and emerging TeraGrid resources and services; delivery of HPC training
content through live, synchronous and asynchronous mechanisms to reach current and potential
users of TeraGrid across the country; providing high quality reviewed HPC learning materials while
leveraging the work of others to avoid duplication of effort; and maintaining a calendar of HPC and
CI training events.
D-19
TeraGrid Extension: Bridging to XD
The EOT team in collaboration with AUS and User Services conducts an annual HPC training
survey, separate from the annual TeraGrid User Survey, to assess community needs for training in
more depth. There will also be surveys of participants during each training session. Survey results
are used to identify areas for improvement and to identify topics for new content development. The
training that is offered in response to the identified needs will focus on expanding the learning
resources and opportunities for current and potential members of the TeraGrid user community by
providing a broad range of live, synchronous and asynchronous training opportunities. The topics will
span the range from introductory to advanced HPC topics, with an emphasis on petascale
computing.
We will continue to develop and deliver new HPC training content to address novel community
needs. The development efforts will involve multiple working groups as appropriate including the
User Services, AUS, Science Gateways, and DVI teams. The training teams will build on the lessons
learned and successes from past efforts to make more training available through synchronous
delivery mechanisms to reach more users across the country. There will be an increased level of
effort in the TeraGrid Extension period directed towards accelerating the pace of making content
available via asynchronous tutorials. All training materials will be reviewed to ensure that they are of
the highest quality, before they are made available to the community for broad dissemination. The
training team will document the challenges, effective strategies, and lessons learned from the efforts
to date to share with the XD awardee(s).
D.5 Integrated Operations of TeraGrid
The integrated operations of TeraGrid encompasses a range of activities spanning the software
integration and support, operational responsibilities across the project and at the Resource Provider
sites, and efforts to maintain quality, usability and security of the distributed environment for the user
community. These are activities found at any computational facility, but in TeraGrid are distributed
and coordinated across the breath of the project in order to provide users with a coherent view of a
collection of resources beyond what any single facility could offer. Specific activities include the 7x24
TeraGrid Operations Center (TOC), networking interconnect between the RP sites, providing phone
and email user support and issue tracking, resource and service monitoring, user management and
authentication, production security and incident response, monitoring and instrumentation, and the
integration and maintenance of a common software state and consistent computing environment. In
the TeraGrid Extension period, TeraGrid will continue to maintain these activities and make
advancements as described in the subsequent section.
Our focus on the direct operational aspects of TeraGrid is important for the science community.
Although TeraGrid resources are specialized for particular tasks, users frequently make use of more
than one resource or migrate over time from one resource to another. A sense of common system
design, operation, and user support is very important for TeraGrid users. The network provides
capacity well beyond what users would have available at their home institutions and new security
services are rapidly bringing us to the point where users will be able to simply use certificates across
administrative domains using gateways, portals or grid applications.
D.5.1 Packaging and maintaining CTSS Kits
Software components are a critical element of TeraGrid’s common user environment. Significant
effort is required to satisfy the critical user need for uniform interfaces in the face of great diversity of
hardware/OS platforms on TeraGrid and the ongoing discovery of bugs and security flaws. The
software packaging team generates: (1) rebuilt software components for TeraGrid resources to
address security vulnerabilities and functionality issues; (2) new builds of software components
across all TeraGrid resources to implement new CTSS kits; and (3) new builds of software
components to allow their deployment and use on new TeraGrid resources. This work is strictly
demand-driven. During this project period we will add Track 2d resource(s), XD visualization
resource(s), and new data archive systems. The packaging team reuses and contributes to the NSF
OCI Virtual Data Toolkit (VDT) production effort.
D-20
TeraGrid Extension: Bridging to XD
This team also responds to help desk tickets concerning existing CTSS capability kits and assists
both resource providers and software providers with debugging software issues, including but not
limited to defects.
D.5.2 Information Services
TeraGrid’s integrated information service (IIS) is the means by which TeraGrid resource providers
publicize availability of their services, including compute queues, software services, local HPC
software, data collections, and science gateways. By the end of TeraGrid’s PY5, most of the
descriptive data about TeraGrid that is (or formerly was) stored in a myriad independently operated
databases will be accessible in one place via the IIS. The IIS combines distributed publishing with
centralized aggregation: each data provider publishes its own data independently of others, while
users see a coherent combined view of all data. The IIS is used throughout the TeraGrid system—in
user documentation, automated verification and validation systems, automatic resource selection
tools, and even in project plans—to provide up-to-date views of system capabilities and their status.
During this project period, the IIS will integrate new Track 2d resource(s), new XD visualization
resource(s), and possible new data archive systems. Several new capabilities will also be tracked by
IIS, including WAN file systems and advanced scheduling capabilities. We anticipate significant
growth in the use of the IIS—both by humans and by automated systems—that may require
capacity/scalability improvements for the central indices. Finally, the IIS will be prepared for
transition to the new XD CMS awardee.
D.5.3 Supporting Software Integration and Information Services
Several advanced user capabilities on TeraGrid rely on centralized services for their day-to-day
operation. These include: automatic resource selection, co-allocation, queue prediction, ondemand/urgent computation, the integrated information service (IIS), and our multi-platform software
build and test capability. We will maintain these centralized services and ensure their high availability
(99.5%) to the TeraGrid user community. High availability requires redundant servers in continuous
operation in distributed locations with a design that includes automatic, user-transparent failover. We
are able to provide this at a low cost using virtual machine (VM) hosting technology at multiple RP
and commercial locations and a dynamic DNS system operated by the TeraGrid operations team.
D.5.4 Networking
The goal of the TeraGrid network is to facilitate high-performance data movement between the
TeraGrid Resource Provider (RP) sites. As such, this network is exclusive to TeraGrid applications,
such as file access/transfer via Global File Systems, data archive, and GridFTP. Users at nonTeraGrid institutions access TeraGrid resources through the site’s normal research and education
networks.
The TeraGrid network connects all TeraGrid RP sites and resources at 10 Gb/s or more. The
backbone network is comprised of hub routers in Chicago, Denver and Los Angeles that are
maintained by the GIG. The three routers are connected with two10 GB/s links—one primary and one
backup. The configuration provides for redundancy in case of circuit failures. The RPs connect to one
of these hub routers, and maintain local site routers to connect to their local network and resources.
In the TeraGrid Extension period, the TeraGrid networking group will continue to operate the TeraGrid
network in its current configuration, which includes the maintenance contracts for the backbone
network hardware. The working group will continue to provide the same support it has for the first five
years of the project, which includes troubleshooting, performance monitoring, and tuning.
In addition, this project will fund connectivity for sites with computational resources not provided under
Track 2. These sites are LONI/LSU, SDSC, UI, Purdue, ORNL, and PSC. PSC’s funding will be for
three months of connectivity support for Pople in advance of their Track 2 system coming online.
D-21
TeraGrid Extension: Bridging to XD
D.5.5 Security
Incident response
Security of resources and data is a top priority for the TeraGrid partnership. The TeraGrid Incident
Response (IR) team will continue to operate, coordinating and tracking incident information at the
RP sites. The IR team has members from all sites and coordinates via weekly conference calls and
over secured email lists. The team develops and executes response plans for current threats and
coordinates reporting to NSF regarding security events. The GIG will fund sites providing resources
not funded by Track 2 to provide security for those resources, including day-to-day security
maintenance and incident response. Track 2 sites will provide for their security from their operational
awards.
User-Facing Security Services
The TeraGrid provides a single sign-on mechanism to its user community, giving them a standard
method for authenticating to access any TeraGrid resources to which they have access. This service
depends on a set of core services including a java-based PKI-enabled SSH application in the TGUP,
a provision for single sign-on across resources provided by the MyProxy service, and the TeraGridwide Kerberos ..The MyProxy and Kerberos services are deployed at both NCSA and PSC for fault
tolerance. All of these services will be supported and maintained during the TeraGrid Extension
period in order to continue to facilitate simplified authentication for TeraGrid users.
Additional TeraGrid Extension period activities will include supporting the advanced access services
for science gateways (the community accounts and Science Gateway capability kit described in
§D.3.5.2) and supporting the Shibboleth capability integrated into the TeraGrid User Portal (§D.4.2).
We will also continue to advance these services by providing for integration with the instrumentation
work (§D.5.8.3) to better track and analyze usage and continue to expand the user base of these
services in preparation for XD.
D.5.6 Quality Assurance
As the benefits of using Grid services are being recognized by both individual users and SGW, the
increase in demand for Grid services have resulted in unexpected behavior and unanticipated
instability. One outcome of a concerted effort by many TeraGrid staff to address these problems was
the recognition for the need to formalize and establish a quality assurance (QA) effort for TeraGrid
systems and services. A QA group was formed in late 2008 to address this issue. The near term
task of the group was to develop a plan to improve the availability of grid services as quickly as
possible. Through PY5 work will continue to reach this goal to improve grid service availability and
reliability. Looking ahead, the QA working group will continue to work toward improving the quality of
all aspects of using TeraGrid grid services. In PY6, the team wil collaborate with the TAIS group to
transition their work to XD to ensure that the XD environment and services are of the same high
quality as those that will have been established for the TeraGrid project.
D.5.7 Common User Environment
A goal of the TeraGrid is to allow users to move between systems with relative ease. This is difficult
to achieve since it requires a high degree of coordination between sites with diverse resources. This
diversity of resources is a strength of the TeraGrid and imposing unnecessary uniformity can be an
obstacle to scaling and to using each resource's specific abilities to the fullest. In 2008, the Common
User Environment (CUE) group was established as a forum for strengthening TeraGrid’s efforts to
providing a common environment and to strike the right balance of commonality and diversity across
the project’s resources. The group quickly undertook extensive gathering of user requirements,
identified a series of recommendations, and have begun creating and implementation plans based
on those recommendations. In PY6, the CUE group will continue to work with TeraGrid operations
and the QA groups to establish and refine the common environment and evaluate the effectiveness
of elements for the user community.
D-22
TeraGrid Extension: Bridging to XD
D.5.8 Operational Services
D.5.8.1 TOC Services
The TeraGrid Operations Center (TOC) will continue to provide 24x7 help desk services for the user
community. The TOC is accessible via toll-free telephone, email and the web. As an initial global
point of contact and triage center for the TeraGrid community, the TOC solves problems, connects
users to groups and individuals for problem resolution and maintains the TeraGrid Ticket System
(TTS). The TTS is used both to ensure issues receive appropriate follow up and to collect data on
the types of issues the users are facing in order to better focus project support resources.
D.5.8.2 UFP Operational Infrastructure and RP Integration
The User-Facing Projects (UFP) team operates a suite of services for providing users access to
TeraGrid resources and information. This set of services is geographically distributed and
encompasses the:
 TeraGrid User Portal
 TeraGrid Web Site
 TeraGrid Wiki and Content Management System, critical for internal project communications
 POPS, the system for TeraGrid allocation requests
 TeraGrid Central Database (TGCDB) and Account Management Information Exchange
(AMIE) servers for accounting, allocations, and RP integration
 Resource Description Repository
 TeraGrid Knowledgebase
 TeraGrid allocations and accounting monitoring tools
 Suite of resource catalogs, monitors, and news applications
In the TeraGrid Extension period, these services will continue to be operated and coordinated as
production services by the UFP team. UFP strives for a better than 99% uptime for all of these
components to ensure a productive and satisfying user experience.
D.5.8.3 Operational Instrumentation (device tracking)
The TeraGrid developed and supports a suite of operational instrumentation software that is used to
monitor grid and network usage. In the TeraGrid Extension period, this instrumentation will continue
to be developed to provide better integration of the different instrument platforms to simplify reporting
and provide integrated data views for the user community. New resources will be incorporated,
including LONI’s final network connection and the Track 2c and 2d platforms.
In order to facilitate adoption in the XD program and benefit the broader community, the reporting
system will be released as an open source tool for use by other organizations utilizing the Globus
monitoring system.
D.5.8.4 Inca Monitoring
Inca provides monitoring of TeraGrid resources and services with the goal of identifying problems for
correction before they hamper users. The Inca team at SDSC, who developed the software, will
continue to maintain its deployment on TeraGrid, including writing and updating Inca reporters (test
scripts), configuring and deploying reporters to resources, archiving test results in a Postgres
database, and displaying and analyzing reporter data in Web status pages. The Inca team will work
with administrators to troubleshoot detected failures on their resources and make improvements to
existing tests and/or their configuration on resources. In addition, the team will implement new tests
identified by TeraGrid working groups or CTSS kit administrators and deploy them to TeraGrid
resources as part of the suite of Inca reporters. The team will modify Web status pages as CTSS
and other working group requirements change. SDSC will continue to upgrade the Inca deployment
on TeraGrid with new versions of Inca (as new features are often driven by TeraGrid) and optimize
performance as needed.
D-23
TeraGrid Extension: Bridging to XD
D.5.9 RP Operations
In developing plans for this proposal, the TeraGrid team took a strategic view of how best to allocate
lesser funds than have been available to the program to date. With respect to considering current
resources at RP sites, it was clear that we could not simply continue “business as usual.” Resources
to be continued must provide a clearly defined benefit to the user community either through direct
provision of important capabilities or by providing a resource for developing/enabling important new
capabilities. With this in mind we came to consensus on a subset of resources to continue to
support.
D.5.9.1 Cluster Compute Resources
100,000
Millions of NUs
Requested NU's
While it is clear that the the lack of new
Awarded NU's
computational resources curbed the
Available NU's
growth curve of use by the community,
Available NU's (Projected)
the deployment of Ranger and
10,000
subsequently Kraken has spurred a new
surge in requests and usage from the
community. As shown in Figure 3, the
growth in requests and awards of
1,000
resources have already matched the
currently available resources and given
that there is little growth in the available
resources, even when the Track 2c
100
systems becomes available some time
in 1H2010, the user demand will
continue to clearly outstrip the
availability of new resources. Still, given Figure 2: Requested, Allocated and Available Resources
the budget available, the ratio of impact
for TeraGrid Large Resources (BigBen, Abe, Ranger,
to cost was considered in evaluating
Kraken)
resources to operate during the
extension period.
During the TeraGrid Extension period, we will retain the four primary IA32-64 cluster resources
shown in Table 1. While these clusters will provide capacity (collectively approximately half of
Ranger), they will be focused on supporting four additional efforts:
Large scale interactive and on-demand (including science gateway) use. We have been given
clear indications from the user community, the Science Advisory Board, and review panels that more
effort is needed in this area. Often researchers need an interactive resource in order to be able to
effectively develop models and debug applications at scale. In some cases this will be in preparation
for
longer-running
execution
on Ranger,
Peak
Kraken
or
other largeSystem
Performance Memory Nodes Disk
Manufacturer
scale system. In other
Abe
90 TF
14.4 TB
1200
400 TB
Dell
cases it is the best mode
of use to conduct science.
Lonestar
62 TF
11.6 TB
1460
107 TB
Dell
Further, many science
Steele
66 TF
15.7 TB
893
130 TB
Dell
gateways need access to
resources
with
short
QueenBee
51 TF
5.3 TB
668
192 TB
Dell
response times to provide
Table 1: TeraGrid IA32-64 Cluster Systems
a useful experience for
the gateway users. We
will make use of reservations and pre-emptive scheduling to satisfy the needs of such gateways.
D-24
TeraGrid Extension: Bridging to XD
Transition platform to Track 2 systems: These systems will provide a transition platform for those
coming from a university- or departmental-level resource and moving out into the larger national CI.
Typically such researchers are accustomed to using an Intel-based cluster and these resources will
provide a familiar platform with which to expand their usage and to work on scalability and related
issues. Researchers will not be restricted to taking this path and will have the option to move directly
to the Track 2 systems, but many have asked for this type of capability. By making use of these
platforms in this way, we also alleviate the pressure of smaller jobs on the larger systems that have
been optimized in their configuration and operational policies to favor highly scalable applications.
Metascheduling and Job Affinity: These resources will have the metascheduling CTSS Kit
installed and will be allocated as a single resource. Given that there is some variation amongst these
systems (e.g. Steele has a mixture of GigE and high-performance interconnects, Lonestar is
configured to support more memory bandwidth per core), we will preferentially schedule jobs
needing certain characteristics to the appropriate particular resources. This will maximize the
efficiency and effectiveness of the collective resource.
Support for OSG Jobs: As noted in §D.X.Y, there have been effort to further develop the
relationship between TeraGrid and OSG. As part of our work during the TeraGrid Extension period,
we will support running of “traditional” OSG jobs (high-throughput, single node execution) in addition
to our efforts to support the less common parallel jobs from OSG users. These single node jobs at a
minimum can backfill the schedule of jobs across the set of resources, but we will also want to allow
them to have “reasonable” priority, as opposed to how low-level parallel jobs are typically handled by
scheduling policies on large systems today. Making use of the job affinity scheduling already
mentioned, we will schedule these jobs to appropriate resources (e.g. the GigE connected portions
of Steele).
D.5.9.2 Unique Computing Resource
The NCSA Lincoln cluster provides a unique GPU-based computing resource at scale. With 192
compute nodes and 96 S1070 Tesla units, this system represents the first at-scale GPU-based
resource to be available to the academic research community. Initial allocation requests are
substantial and early applications work has shown it to be effective for a subset of important
applications (NAMD, WRF …). While it will not be easily used for a broad range of applications, it will
provide a powerful capability for a set of important applications, and it can be used to support porting
of applications to the hybrid machines that we expect to be more common in the future.
D.5.9.3 Supporting Virtual Machines
An emerging need is the use of virtual machines (VMs) to support scientific calculations. Since 2008,
IU has provided a VM hosting service, using their Quarry resource, that is rapidly gaining adoption
within TeraGrid; IU will continue this service through the extension period. The system currently
hosts 17 VM instances for 16 different projects supported and allocated by the TeraGrid. Projects
include science gateways (e.g. front-end services), TeraGrid Information Services, and data
collections (e.g. Flybase, a data collection for Drosophila genomics). Gateways are provided with
complete virtual operating systems (including root access) for their development and deployment
needs. OpenVZ is used for virtualization, which restricts operating systems to compatible (RHEL)
kernels but incurs much less performance overhead since it has no hypervisor. This has allowed
mounting of IU’s Lustre-WAN (in testing now, will be expanded in the extension period) to provide
additional disk space for images. This support for VMs will also facilitate support of OSG users. We
also note that VMs represent another potential usage modality for the four cluster resources noted
above, along with Quarry at IU.
D.5.9.4 Supporting the Track 2c Transition
PSC's Pople (768 processor, Altix 4700, 1.5 TB shared memory) together with Cobalt at NCSA,
represent the large shared memory resources in the TeraGrid. They are in great demand, and
consistently oversubscribed at TRAC meetings. PSC has exploited the availability of shared memory
D-25
TeraGrid Extension: Bridging to XD
resources to attract new communities to the TeraGrid, including researchers in game theory,
machine translation, parallel Matlab users, etc. The Track 2c system will deliver substantially more
shared memory resources to the national community. But since the onset of the Track 2c proposal
will be somewhat delayed compared to what was originally proposed, funds are requested for a
three month period of Pople operation in the TeraGrid Extension period to assure continued
production access to these valuable resources.
D.6 Education, Outreach, Collaboration, and Partnerships
Work in education, outreach, collaboration, and partnerships is driven by both community
requirements and the desire to advance the science, technology, engineering and mathematics
(STEM) fields of education and research. TeraGrid regularly assesses requirements in these areas
through the annual TeraGrid User Survey, an annual HPC training and education survey, surveys
completed at training and education events, and discussions with community members, the SAB,
and our external partners.
TeraGrid’s Education, Outreach, and Training (EOT) area seeks to engage and retain larger and
more diverse communities in advancing scientific discovery, emphasizing under-represented
communities, including race, gender, disability, discipline, and institution. EOT will continue to build
strong internal and external partnerships, leveraging opportunities to offer the best possible HPC
learning and workforce development programs, and increasing the number of well-prepared STEM
researchers and educators. EOT will continue to conduct formative and summative evaluations of all
programs and activities, allowing TeraGrid to improve the offerings to best address community
requirements, to identify best practices, and to identify transformative impact among the target
communities.
For the TeraGrid Extension, the TG Forum has determined an increased EOT and ER effort level is
necessary to further the goals of TeraGrid. The EOT and ER teams will work closely with the TG
Forum and the TG ADs to plan this increased level of work, including the relevant WBS elements,
scopes of work, and budgets, to be shared among the GIG and the RPs using the TeraGrid
integrated planning process. The plans will be vetted with the SAB prior to being finalized.
In the TeraGrid Extension period, we have increased the support level to involve undergraduate and
graduate students.We will mentor students to encourage them to pursue STEM education and
careers, and they will be provided with travel support to attend the TeraGrid Conference, where they
can learn from and share with other students and conference attendees. The students will be
encouraged to submit papers and posters and to enter student competitions to showcase their
knowledge and skills.
All EOT activities will factor in the transition from TeraGrid to XD, and the EOT team will document
all activities for hand-off to the XD awardees.
D.6.1 Education
TeraGrid has established a strong foundation in learning and workforce development focused
around computational thinking, computational science, and quantitative reasoning skills, to motivate
and prepare larger and more diverse generations to pursue advanced studies and professional
careers in STEM fields. RPs have led, supported, and directly contributed to K-12, undergraduate,
and graduate education programs across the country. Activities focus on:
 Providing professional development for K-12 teachers and undergraduate faculty;
 Supporting curriculum development efforts by K-12 teachers and undergraduate faculty;
 Collecting and disseminating high-quality reviewed curricular materials, resources, and
activities for broad dissemination and use; and
 Engaging students to excite, motivate, and retain them in STEM careers.
The education team will provide support to educators developing computational science and HPC
curriculum materials through local, regional, and national programs and through the 5 year SC07-
D-26
TeraGrid Extension: Bridging to XD
SC11 Education Program. Workshops, institutes and tutorials will be offered to support teachers and
faculty throughout the year. Computational science and HPC curricular materials will be reviewed
and disseminated through the Computational Science Education Reference Desk, a Pathways
project of the National Science Digital Library.
TeraGrid will provide students with internships, research experiences, professional development,
competitions, and numerous learning opportunities, to recruit, excite and motivate many more
students to pursue STEM education and STEM careers. Particular emphasis will be placed on
engaging under-represented students. The internships and research experiences will include
summer experiences and year-long involvement at RPs.
In the TeraGrid Extension, the education effort will include an increased support level for two
complementary components: undergraduate education materials development and student
competitions.
The first effort’s focus is working with faculty to develop undergraduate HPC materials including
modules, teacher activities, and student activities for four different disciplinary areas, to be identified
in an initial meeting of faculty and TeraGrid staff. The effort will build on expertise of TeraGrid AUS
and SGW teams. Following an application process, we will select faculty in consultation with the
SAB to ensure appropriate disciplinary representation, and participants will require institutional
commitments to support their efforts. The team will reconvene halfway through the project to present
materials, receive constructive feedback for improvements, and then pilot each other’s materials
during the second half of the year to demonstrate re-usability by others. Materials will be reviewed a
final time and then posted to HPC University for broad dissemination. We plan to add 0.2 FTE to
coordinate this effort and a $10,000 faculty stipend.
To focus on engaging students from middle school through college in STEM challenges, the second
effort will build on the faculty work and on the “Computational Science Problem of the Week” effort
started in March 2009. The challenges are intended to empower students to unleash their minds to
solve challenging problems and to be recognized for their accomplishments. We will build on this
foundation of student excitement to engage national programs that foster student engagement
through local, regional, and national competitions such as the TeraGrid Conference competitions
and SC Education Program competitions. We will have an additional 0.2 FTE for a coordinator and 2
undergraduate students to develop challenge problems and review student submissions.
We are working with the National Science Olympiad (http://soinc.org/), which has for 25 years been
engaging over 5,300 teams of middle school and high school students from 48 states, to introduce
computational science challenges into their national effort. This is intended to excite, engage, and
empower students across the country to pursue STEM education and careers and to advance
science through the use of computational methods. We will also explore opportunities to work with
the ACM Student Programming Contest, the National College Bowl, and the Siemens Competition in
Math, Science & Technology as other possible venues to engage thousands more students across
the country. We will use emerging youth-oriented collaboration spaces (Facebook, MySpace, etc.) to
reach out to students where they live and communicate with one another in cyberspace.
The team will document the challenges, effective strategies, and lessons learned from these efforts
and share them publicly. The team will emphasize strategies for professional development,
curriculum development, dissemination of quality reviewed materials, student engagement, recruiting
in under-represented communities, and strategies for working with other organizations to sustain and
scale up successful education programs.
D.6.2 Outreach
TeraGrid has been conducting an aggressive outreach program to engage new communities in
using TeraGrid resources and services. The impact can be seen in the number of new DAC (now
Start-up and Education) accounts that have been established over the last few years. In 2007, there
were 736 requests for DACs of which 684 were approved. In 2008, there were 762 requests of which
D-27
TeraGrid Extension: Bridging to XD
703 were approved. There were 17 education accounts approved in the last quarter of 2008.
TeraGrid has also been working to increase the number of new users. In 2007 there were 1,714 new
users, and in 2008 there were 1,862 new users.
TeraGrid has been proactive about meeting people “where they live” on their campuses, at their
professional society meetings, and through sharing examples of successes achieved by their peers
in utilizing TeraGrid resources. Outreach programs include Campus Champions, Professional
Society Outreach, EOT Highlights, and EOT Newsletter. These activities focus on:



Raising awareness of TeraGrid resources and services among administrators, researchers
and educators across the country;
Building human capacity among larger and more diverse communities to broaden
participation in the use of TeraGrid; and
Expanding campus partnerships.
Based on the current interest level, we plan to rapidly expand the Campus Champions program. The
June 2008 launch of the Campus Champions program resulted in a groundswell of interest from
campuses across the country. What began as a start-up effort to recruit 12 campuses has now
reached 30 campuses with another 30+ in discussions about joining. The Campus Champions
representatives (Champions) have been providing great ideas for improving TeraGrid services for
both people new to TeraGrid as well as “old hands”. The TeraGrid User Survey shows that more
campus assistance would be valuable to users, and that more start-up assistance, documentation,
and training are needed for users and the Champions. We plan to continue to support this effort, and
because of the high interest level we plan to increase the effort level. We will invest an additional 0.5
FTE to coordinate the program and an additional 0.5 FTE to provide technical support, training, and
documentation that will directly benefit the Champions. We also will add an undergraduate student to
assist the professional staff working with the Champions. TeraGrid will work with the CI Days team
(OSG, Internet2, NLR, EDUCAUSE, and MSI-CIEC) to reach more campuses and to enlist more
Campus Champions.
We will also continue to raise awareness of TeraGrid through participation in professional society
meetings, emphasizing under-represented disciplines and under-represented people. TeraGrid will
present papers, panels, and posters, workshops, tutorials, and exhibits to reach as many people as
possible to encourage them to utilize TeraGrid. TeraGrid will continue to host the TeraGrid
Conference in June and participate in the annual SC Conference.
Through Campus Champions, CI Days, and professional society outreach, TeraGrid will identify new
users and potential users that may benefit from support from the User Services and AUS teams to
become long-term users of TeraGrid. We will work with the Science Director, the SAB, and external
partners to identify these candidate users. A concerted effort will be made to reach out to areas of
the country that have traditionally been under-represented among TeraGrid users, including the
EPSCoR states.
We will document challenges, effective strategies, and lessons learned from current efforts to share
with the XD awardees and the public, emphasizing strategies for identifying additional outreach
opportunities, identifying and engaging new users, and nurturing strong campus partnerships to
broaden the TeraGrid and XD user bases.
D.6.3 Enhancing Diversity
Through both its education and outreach efforts, TeraGrid will continue to target under-represented
disciplines with the goal of enhancing the racial and ethnic diversity of the TeraGrid user community.
We will engage industry, international communities, and other organizations on activities of common
interest and provide community forums for sharing the impact of TeraGrid on society. We will
continue to work with organizations representing under-represented individuals, including
organizations in the Minority Serving Institution Cyberinfrastructure Empowerment Coalition (MSICIEC): the American Indian Higher Education Consortium (AIHEC), the Hispanic Association of
D-28
TeraGrid Extension: Bridging to XD
Colleges and Universities (HACU), and the National Association for Equal Opportunity (NAFEO). We
will continue to reach out to EPSCoR institutions via Campus Champions. TeraGrid will also
continue to engage larger numbers of students, with an emphasis on activities targeting underrepresented students.
D.6.4 External Relations (ER)
To meet NSF, user, and public expectations, information about TeraGrid success stories—including
science highlights, news releases, and other news stories—will be made accessible via the TeraGrid
website and distributed to news outlets that reach the scientific user community and the public,
including iSGTW, HPCwire, and NSF through the Office of Cyberinfrastructure (OCI) and the Office
of Legislative and Public Affairs (OLPA). We also design and prepare materials for the TeraGrid
website, conferences, and other outreach activities.
While TeraGrid is yielding more and more success stories, the ER team cannot document all of them
due to lack of resources. Further, as we enter the TeraGrid Extension, considerable time and
attention is needed to document lessons learned to assist with the transition to XD. Therefore, we
will increase the effort level by an additional 0.75 FTE. In addition, two undergraduate
Communications students will assist with literature searches, interviewing users and staff, and
writing the information to be shared with the community. The team will use a variety of multimedia
venues to broadly disseminate the news, including podcasts, Facebook, and professional society
newsletters.
The ER working group, will regularly share information, strategize plans, and coordinate activities to
communicate TeraGrid news and information. The ER team will continue to convey information
about TeraGrid to the national and international communities, via press releases, science impact
stories, EOT impact stories, news stories, and updates on TeraGrid resources and services. The
team will produce the Science Highlights publication to highlight science impact and will work with
the EOT team to produce the EOT Highlights publication. The ER team will continue to work closely
with the NSF OCI public information experts to ensure TeraGrid information is effectively
communicated.
The ER team will collaborate with the User Facing Projects team on the enhanced TeraGrid web
presence. The ER working group will utilize Web 2.0 and multimedia tools to dynamically
disseminate TeraGrid news and information and engage the 18-35 year-old demographic who utilize
online social networking tools and portal-based communication.
The ER team will continue to support TeraGrid involvement in professional society meetings,
including the annual TeraGrid and SC conferences, and help develop promotional pieces for use at
conferences and meetings. The team will document challenges and successful strategies in working
with the TeraGrid staff and the users to capture success stories, news and other information of value
for sharing with the community.
D.6.5 Collaborations and Partnerships
TeraGrid intends to remain a technology leader in the broader national and international
computational science community, regularly collaborating with universities and organizations – both
domestically and overseas – in advancing the state of the art in cyberinfrastructure. These
collaborations range from the Partnership for Advanced Computing in Europe (PRACE) and the
Distributed European Infrastructure for Supercomputing Applications (DEISA) to the Chinese
Academy of Science and the Universidad del CEMA in Buenos Aires.
During the TeraGrid Extension, we will collectively work to further develop current and identify new
domestic and international collaborations through TeraGrid users, participation in professional
society meetings, and recommendations from the SAB and elsewhere. In the US, for example,
TeraGrid will continue to extend its connections to OSG §D.5.9.1. The TeraGrid and OSG
infrastructures both provide scientific users with access to a variety of resources using similar
infrastructures and services. TeraGrid users have access to NSF-funded HPC systems, but OSG
D-29
TeraGrid Extension: Bridging to XD
users normally only have access to less powerful, more widely distributed resources. Depending on
the application, some OSG users could benefit from using significantly more powerful, tightly
coupled, clusters that are part of the pool of TeraGrid compute resources. Additionally, we know that
some TeraGrid users have components of their workflow that are better suited to a blend of OSG
and TeraGrid resources.
As an international example, the TeraGrid will continue to build on its partnership with DEISA and
advance the distributed, international use of both computational and data resources. DEISA has
adopted the TeraGrid’s Inca system for resource monitoring, and TeraGrid is collaborating on efforts
to have projects use both DEISA and TeraGrid resources, including the ability to co-schedule
resources across both organizations for large science users. Science applications serving as drivers
for these DEISA collaborations include climate research (with the Global Monitoring for Environment
and Security effort), the life sciences (with the Virtual Physiological Human project), and
astrophysics (with LIGO, GEO600, and the Sloan Digital Sky Survey).
D.7 Project Management and Leadership
D.7.1 Project and Financial Management
The Project Management Working Group (PM-WG) is responsible for building, tracking, and
reporting on the Integrated Project Plan (IPP), and a change management process.
Central project management coordination provides tighter activity integration across the TeraGrid
partner sites. Building a single IPP for the project enhances cross-site collaboration and reduces
duplication of efforts. Tracking a single IPP provides for a more transparent view of progress.
Reporting against a single IPP significantly reduces the complexity of integrating many disparate
stand-alone RP reports. Managing a change process provides a visible and controlled method for
modifying the IPP.
Financial Management is the responsibility of the University of Chicago. Subaward management will
be straightforward since contracts are already in place from the current TeraGrid award.
D.7.2 Leadership
Ian Foster is the Director of the Computation Institute; Arthur Holly Compton, Distinguished Service
Professor of Computer Science, Argonne National Laboratory, and The University of Chicago. He
has lead computer science projects developing advanced distributed computing ("Grid")
technologies, computational science efforts applying these tools to problems in areas ranging from
the analysis of data from physics experiments to remote access to earthquake engineering facilities,
and the Globus open source Grid software project. The objective of the Global Grid Forum is to
promote and develop Grid technologies and applications via the development and documentation of
"best practices," implementation guidelines, and standards with an emphasis on rough consensus
and running code.
Ian Foster: Center for Health Informatics (CHI), St. Johns Hospital (130309), $750k, 8/08-8/09;
CIM-EARTH: A Community Integrated Model of Economic & Resource Trajectories for
Humankind, MacArthur Foundation, (08-92430-000), $350k, 7/08-6/10; ETF Grid Infrastructure
Group: Providing System Management and Integration for the TeraGrid, OCI-0503697, $60M,
8/1/09-7/31/10; The caGrid ,Knowledge Center, NIH-OSU Res Fund(1X5290), $150k, 5/08-4/11;
The caGrid, Knowledge Center, NIH-OSU Res Fund (NCI28XS12), $64k, 9/08-05/09.
John Towns is Director of the Persistent Infrastructure Directorate at the National Center for
Supercomputing Applications (NCSA) at the University of Illinois. He is PI on the NCSA Resource
Provider/HPCOPS award for the TeraGrid, and serves as Chair of the TeraGrid Forum, which
provides overall leadership for the TeraGrid project. He has gained a broad view of the
computational science needs and researchers through his key role in the policy development and
implementation of the resource allocations processes of the TeraGrid and preceding NSF-funded
resources. He is co-PI on the Computational Chemistry Grid project led by the University of
D-30
TeraGrid Extension: Bridging to XD
Kentucky. His background is in computational astrophysics, making use of a variety of computational
architectures with a focus on application performance analysis. At NCSA, he provides leadership
and direction in the support of an array of computational science and engineering research projects
that use advanced resources. Towns plays significant roles in the deployment and operation of
computational, data and visualization resources, and grid-related projects deploying technologies
and services supporting distributed computing infrastructure.
J. Towns: Leadership Class Scientific and Engineering Computing: Breaking Through the Limits,
OCI 07-25070, $208M, 10/07-10/12; NLANR/DAST, OCI 01-29681, $2.5M, 7/02-6/06; National
Computational Science Alliance, OCI 96-19019, $249.1M, 10/97-9/05; The TeraGrid:
Cyberinfrastructure for 21st Century Science and Engineering, SCI 01-22296 and SCI 03-32116,
$44.0M, 10/01- 9/05; Cyberinfrastructure in Support of Research: A New Imperative, OCI 04-38712,
$41.1M, 7/06-8/08; ETF Early Operations-NCSA, OCI 04-51538, $1.9M, 3/05-9/06; ETF Grid
Infrastructure Group (U of Chicago lead), OCI 05-03697, $14.1M, 9/05-2/10; TeraGrid Resource
Partner-NCSA, OCI 05-04064, $4.2M, 9/05-2/10; Empowering the TeraGrid Science and
Engineering Communities, OCI 05-25308, $17.8M, 10/07-9/08; Critical Services for
Cyberinfrastructure: Accounting, Authentication, Authorization and Accountability Services (U of
Chicago lead), OCI 07-42145, $479k, 10/07-9/08.
Matt Heinzel is the Deputy Director of the Teragrid Grid Infrastructure Group (GIG) and Director of
TeraGrid Operations at The University of Chicago, Computation Institute. As Deputy Director, he is
responsible for TeraGrid coordination, overall architecture, planning, software integration, and
operations. The GIG manages operation process and improvement projects. The Director of
TeraGrid Operations manages a nation-wide team that provides operational monitoring of all
TeraGrid Infrastructure Services which also operates the TeraGrid help desk.
D.7.2.1 Other Senior Personnel
As described in §Error! Reference source not found., the overall TeraGrid project is led by the TG
Forum membership which includes the RP and GIG PIs. This arrangement gives the RP and GIG
PIs equal decision-making influence in the project. Due to limitations on number of co-PIs on NSF
proposals, the RP PIs are included on this proposal as Senior Personnel.
D-31
Download