TeraGrid Science Gateways, Virtual Organizations and Their Impact

advertisement
TeraGrid Science Gateways, Virtual Organizations and Their Impact on
Science
Nancy Wilkins-Diehr1, Dennis Gannon2, Gerhard Klimeck3, Scott Oster4, Sudhakar Pamidighantam5
1
San Diego Supercomputer Center, University of California at San Diego
2
Indiana University
3
Network for Computational Nanotechnology (NCN), Birck Nanotechnology Center, Purdue University
4
The Ohio State University
5
National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
wilkinsn@sdsc.edu, gannon@cs.indiana.edu, gekco@purdue.edu, oster@bmi.osu.edu, spamidig@ncsa.uiuc.edu
ABSTRACT
Increasingly, scientists are designing new ways to present and share the tools they use to marshal the
digital resources needed to advance their work. The growth of the World Wide Web and increasingly
available digital data – through sensors, large-scale computation, and rapidly growing community codes
continues to drive this transformation of science. Providing high end resources such as those available
through the TeraGrid to scientists through community-designed interfaces enables far greater capabilities
while working in familiar environments. This was the genesis of the Gateway program in 2004. This
paper describes the TeraGrid Science Gateway program and highlights four successful gateways GridChem, Linked Environments for Atmospheric Discovery (LEAD), nanoHUB.org and the cancer
Biomedical Informatics Grid (caBIG).
Keywords: science gateways, Web portals, grid computing, cloud computing
1. Introduction
The TeraGrid project [ref TeraGrid] began in 2001 as the Distributed Terascale Facility (DTF).
Computers, visualization systems and data at four sites were linked through a dedicated 40-gigabit optical
network. Today the TeraGrid today includes 25 platforms at 11 sites and provides access to over a
petaflop of computing power and petabytes of storage.
The TeraGrid has three primary focus areas – deep, wide and open. The goal of TeraGrid deep is to
support the most challenging computational science activities – those that couldn’t be achieved without
TeraGrid facilities. TeraGrid wide seeks to broaden the set of people who are able to make use of the
TeraGrid. The TeraGrid open component seeks compatibility with peer grids and information services
that allow development of programmatic interfaces to the TeraGrid.
This paper discusses the TeraGrid Science Gateways, a part of TeraGrid’s wide initiative. Our goal is to
motivate and explain the Gateway concept and describe services that have developed within TeraGrid to
support gateways. To illustrate these points we will describe four gateways - GridChem, Linked
Environments for Atmospheric Discovery (LEAD), nanoHUB.org and the cancer Biomedical Informatics
Grid (caBIG). The paper will conclude with lessons learned, recommendations for when gateways are
appropriate for a science community and some future directions the project will take.
2. TeraGrid Science Gateways
The TeraGrid Science Gateways program began in late 2004 with the recognition that scientists were
designing their own specialized user interfaces and tools to marshal digital resources for their work.
Much of this work was in response to the growth of the World Wide Web and increasingly available
digital data – both through sensors and large scale computation. It had become clear to the designers of
the TeraGrid that by providing scientists high-end resources such as supercomputers and data archives
through application-oriented interfaces designed by members of the science community, this would
provide unprecedented capabilities to researchers not well versed in the arcane world of high performance
computing. This was the genesis of the Gateway program.
Historically, access to high-end computing resources has been restricted to students who work directly
with a funded “principal investigator” who has been granted access to the resources. Gateways typically
operate through direct web access or through downloadable client programs. This greatly reduces the
barrier to entry and opens the door to exploration by a much wider set of researchers. The ramifications
of this access are profound. Not only can the best minds, regardless of location, be brought to bear on the
most challenging scientific problems but individuals with a completely new perspective can become
involved in problem solving. With an increasing set of problems requiring cross-disciplinary solutions,
gateways can clearly have a major impact on scientific discovery. Because students can also be involved
regardless of institutional affiliation, Gateways can increase participation among underrepresented groups
and contribute to workforce development.
One can characterize a Science Gateway as a framework of tools that allows the scientist to run
applications with little concern for where the computation actually takes place. This is closely related to
the concept of “cloud computing” in which applications are provided as web services that run on remote
resources in a manner that is not visible to the end-user. Furthermore a Science Gateway is usually more
than a collection of applications. Gateways often allow users to store, manage, catalog and share large
data collections or rapidly evolving novel applications that cannot be found anywhere else. The level of
cloud-like abstraction and virtualization varies from one Gateway to another as different disciplines have
different expectations regarding the degree to which its users wish or need to deal with the underlying
TeraGrid resources. Usually a Gateway has a core team of supercomputing savvy developers that build
and deploy the applications on the resources. These applications become the services that can be used by
the larger community. Much of what the TeraGrid Science Gateway team does is to provide tools and
services for these core scientific Gateway developers. Other science gateways are more focused on the
usability of the services to a very broad user base and spent significant effort on the creation of friendly
graphical user interface technology and smooth transition of data between various computational
resources. Delivery of computing cycles rapidly, interactively, without wait times, and any specific grid
knowledge has proven to be critical there.
3. TeraGrid Support for Gateways
Currently the TeraGrid provides back end resources to 29 Gateways which span disciplines and
technology approaches. These gateways are developed independently by the various communities
specifically to meet defined needs. This is indeed what makes gateways so powerful. TeraGrid, however,
must develop scalable service solutions to meet the variety of needs that result from this decentralized
development.
Early in the Gateway program, TeraGrid worked in a focused way with ten individual gateways. Through
an in depth survey, we developed an understanding of what common services were needed to meet the
needs of this very new user community in a scalable way. .Commonalities emerged in the areas of Web
services, auditing, community accounts, scalable, stable, production-level middleware, flexible
allocations and CPU resource scheduling, and job meta-scheduling.
Some Gateway developers rely on the convenience provided via Web services and have requested such
capabilities from the TeraGrid. Version 4 of the Globus Toolkit [ref Globus], installed in production
across the TeraGrid in 2006, included Web service support for tasks such as job submission and data
transfer. Web services are used by approximately 25% of Gateways. Work continues to augment the
basic functionality provided by Globus with TeraGrid-specific information. For example job submission
interfaces may be further developed to include data which could be used to answer questions such as
“Where can I run my 16 processor job that requires 20MB of memory the soonest? Where can I run an
18-hour, 8-processor job using Gaussian the soonest? Which sites support urgent computing for a 32processor job?” and “Do any of these selections changed if a 25GB input file needs to be transferred to
the site of interest?”
Since some gateways provide as well as consume Web services, we are actively working on a registry
where gateway developers can list and share services with each other as well as with potential TeraGrid
users. Today, researchers interested in making use of software on the TeraGrid check a catalog that lists
software available from the command line on all TeraGrid resource provider platforms. In the future, we
envision researchers and developers checking a registry of applications – programmatically or manually.
On the TeraGrid, Gateways are primarily enabled by a concept called community accounts. These
accounts allow the TeraGrid to delegate account management, accounting, certificate management and
user support directly to the Gateways. In order to responsibly delegate these tasks however, TeraGrid
must provide management and accounting tools to Gateway developers and store additional attributes for
jobs submitted through community accounts. Some resource provider sites further secure these accounts
by using the community shell, commsh.
Early in the Gateway program, the Globus team developed an extension to Grid Resource Allocation and
Management (GRAM) called GRAM audit. This is a fundamental extension that allows Gateway
developers to retrieve the amount of CPU hours consumed by a grid job after it has completed. With jobs
for many independent Gateway users being run from a single community account, such a capability is
quite important. Currently the TeraGrid team is developing GridShib for use by Gateway developers so
that attributes unique to a Gateway user can be stored when jobs are submitted to the TeraGrid. This will
provide the TeraGrid with a programmatic way to count end Gateway users using the TeraGrid,and will
also allow the TeraGrid to do per user accounting for Gateways as needed.
Supporting a wide variety of gateways in an environment as extensive as the TeraGrid can be very
challenging and requires a real commitment to scalable infrastructure. For example in 2007 the Linked
Environments for Atmospheric Discovery (LEAD) gateway was used by students from X institutions in a
weather forecast competition. The nature of the competition led to concentrated bursts of job submissions
and data transfer requests. System stability during this period was not what we hoped and resulted in an
extensive collaboration between Globus team members and TeraGrid staff to understand and address both
the hardware and software issues associated with handling the large load. In addition, when using shared
resources there are sometimes unscheduled outages. Programming methods used by Gateway developers
must be fault tolerant. As Gateway use of the TeraGrid expands, periods of high load will be even less
predictable and a very robust infrastructure must be deployed and failover models used by developers to
ensure stable services. Addressing these issues continues as a focus area.
Finally the TeraGrid has had to adapt its allocation and scheduling policies to meet the needs of
Gateways. Because usage cannot be planned as it can with single investigators and his or her research
team, flexible allocation policies are needed. When requesting resources, principal investigators must be
able to describe in general terms how they expect their Gateway to be used and the TeraGrid must be able
react if a gateway is more successful than anticipated,. Service interruptions must be avoided to ensure
the continuity and reliability that are so important to a successful Gateway. Gateways also tend to have
greater interactive needs than researchers working at the command line. Gateway developers have been
early testers in TeraGrid’s metascheduling efforts. Finally the Special Priority and Urgent Computing
Environment (SPRUCE) has been developed within TeraGrid to meet urgent computing needs, where
simulations might be required immediately, for example resulting from sensor data feedback.
This paper highlights several Gateways and includes treatment of their infrastructure, the scientific
capabilities available to the research community through the gateway and the impact they have had.
Featured Gateways include GridChem, LEAD, nanoHUB.org and the cancer Biomedical Informatics Grid
(caBIG).
4. Computational Chemistry Grid (GridChem)
Molecular sciences occupy a central position in research to understand material behavior. Most physical
phenomena from atmospheric and environmental effects to medicinal interactions within cells are
mediated through forces at the molecular and atomic level. Nano device modeling and material design
also involve atomic level detail. Chemical, biological and material modeling require access to unique and
coupled applications to produce accurate and detailed descriptions of atomic interactions. The insights
gained help researchers both to understand known phenomenology and to predict new phenomenology for
designing novel and appropriate materials and devices for a given task. Computational Chemistry has
evolved as a field of study now adopted across multiple disciplines to delineate interactions at a molecular
and atomic level and to describe system behavior at a higher scale. The need for integrative services to
drive molecular level understanding of physical phenomenon has now become acute as we enter an era of
rapid prototyping of nano scale actors, multi-discipline engagement with chemical sciences and diverse
and complex computational infrastructures. Automated computational molecular science services will
transform the many fields that depend on such information for routine research as well as advance fields
of societal significance such as safe and effective drug development and the development of sustainable
food and energy sources. A Science Gateway meets researchers’ needs for both integrative and
automated services.
The GridChem Science Gateway serves the computational chemistry community through support from
the Computational Chemistry Grid(CCG), funded under the NSF NMI-Deployment grants starting in Fall
of 2004 [ref GridChem_1]. The goal of CCG and the GridChem Science Gateway is to bring national
high performance computational resources to practicing chemists through intuitively familiar native
interfaces. Currently the GridChem Science gateway integrates pre- and post- processing components
with the high performance applications and provides data management infrastructures.
CCG staff members include chemists, biologists and chemical physicists who design interfaces and
integrate applications, programming experts who layer user friendly interfaces atop the domain
requirements and finally high performance computing experts who coordinate the usage of the integrated
services and resources by the community at large. A three tier architecture as depicted in [Figure
GridChem] below consists of a Client Portal which presents an intuitive interface for data and pre and
post processing components, a Grid Middleware Server which coordinates communication between
client and the deployed applications through web services and applications deployed on the high
performance resources. The software architecture is supported with a consulting portal for user issues and
an education and outreach effort to disseminate the features and usability and engage the community for
their changing needs.
GridChem Client Portal.
The GridChem Client is a Java application that can either be launched from the GridChem web pages or
downloaded to the local desktop. This serves as the portal to the Computational Chemistry Grid
infrastructure and consists of authentication, graphical user interfaces for molecular building, input
generation for various applications, job submission and monitoring and post processing modules. The
authentication is hierarchical to serve the “Community User” as well individual users with independent
allocations at various HPC resources. The “Community User” is an individual user whose CPU time on
both TeraGrid and CCG partnership is completely managed by CCG. The MyProxy credential
repository is used to manage credential delegation. Proxy credentials are used for various services where
authentication is required. The Job Submission interface provides multiple ways of generating the input
files, a resource discovery module with dynamic information and job requirement definition options.
Multiple independent jobs can be created using this infrastructure and jobs with diverse requirements can
be launched simultaneously. The client also provides a MyCCG module which is central to job-centric
monitoring, post processing and potential resubmission mechanisms. MyCCG provides status and usage
data for individual or members of a whole group under a Principal Investigator apart from job specific
monitoring.
GridChem Middleware Services
The GridChem Middleware Services provide the service architecture for the GridChem client. The
services use Globus-based Web Services to integrate applications and hardware. Application input
specification requirements with default input parameters queue information and a simple metascheduling
capability based on a “deadline prediction” module supported by Network Weather Service (NWS) [ref.
nws] are provided. The Middleware Services allow MyCCG to monitor jobs, provide individual and
group usage data and support dynamic data ingestion with metadata into information repositories.
Hardware Resources and Application Integration
Several popular applications are deployed and supported by HPC staff at resource provider sites. The
GridChem Science Gateway leverages such deployments and abstracts the information needed for their
discovery into the GridChem Web Services database. GridChem tests restart or checkpoint capabilities,
where available, and periodically updates software as needed. Application software access is controlled at
an individual user level for an individual resource.
Development and
Deployment
User Services
Client Portal
Allocation
Management
Consulting and
Trouble
shooting
Outreach
Help Book
Training
Materials
User Surveys
Presentations
TG Student
Challenge
Grid
Middleware
services
Dissemination
Application
Integration and
validation
Non-TG CCG
TeraGrid Consortium OpenScience Grid
Hardware Resources
[Figure GridChem]
CCG VO for GridChem Science Gateway
The GridChem Science Gateway is currently in production, used by a community of about 300
researchers who consuming about 80,000 CPU hours per quarter. In the last year and a half at least 15
publications resulted from such a usage in various reputed journals [ref. GridChemScience]. As we
deploy advanced features such as meta-scheduling services and integrate material and biological
modeling applications we expect a rapid growth in usage in the coming years. The resulting scientific data
can be mined by the community as a whole. The development of automated metadata collection services
for such an information archive will enhance these data mining capabilities.
5. Linked Environments for Atmospheric Discovery (LEAD)
The National Science Foundation (NSF), as part of the Large Information Technology Research (ITR)
program, started the Linked Environments for Atmospheric Discovery Project (LEAD) in September
2003. The goal of the project is to fundamentally change mesoscale meteorology (the study of severe
weather events like tornados and hurricanes) from the current practice of static observations and forecasts
to a science of prediction that responds dynamically and adaptively to rapidly changing weather.
Currently, we cannot predict tornados. However, we can spot conditions that may make them more likely
and, when a tornado is spotted, we can track it on radar. Current forecast models are very course and they
run on fixed time schedules. To make progress, we must change the basic paradigm. More specifically,
the LEAD cyberinfrastructure is designed to allow any meteorologist to grab the current weather data and
create a specialized high-resolution weather forecast on-demand. In addition, the system provides
scientists with the ability to create new prediction models using tools that allow the exploration of past
weather events and forecasts. Finally, as described in [ref casa-lead], a major goal of the project is to find
ways in which the weather forecast system can adaptively interact with the measuring instrumentation in
a “closed loop” system.
Weather forecasts are, in fact, rather complex workflows that take data from observational sensors, filter
and normalize it, then assimilate it into a coherent set of initial and boundary conditions, then feed that
data to a forecast simulation and finally to various post processing steps. In all, a typical forecast
workflow may require 7 to 10 data processing or simulation steps, each of which requires moving several
gigabytes of data and the execution of a large program.
To accomplish these goals, the LEAD team built Web service based cyberinfrastructure with a Science
Gateway Portal on the front end and using TeraGrid on the back-end for the computation, data analysis
and storage (see [Figure LEAD]). The Gateway is designed to support a wide variety of meteorology
users. On one end the core audience is the mesoscale researchers who are keenly interested in developing
the new techniques for making accurate on-demand forecasts. On the other end are the high school and
college students taking classes on weather modeling and prediction. Both groups have significant
requirements that impact the design of the system.
The meteorology researchers need the ability to
1. configure forecast workflow parameters including the geographic region, the mesh spacing and
various physics parameters and then launch the workflow on-demand.
2. modify the workflow itself, by replacing a standard data analysis or simulation component with
an experimental version. Or, if necessary, the scientist may wish to create a new workflow from
scratch. (In [ref eScienceWF], we describe the growing importance of workflow technology in
contemporary e-Science.)
3. treat each forecast workflow execution as a computational experiment, whose metadata is
automatically cataloged. This allows the scientist to return at a later time and discover and
compare all experiments will similar parameter settings, or to retrieve the complete provenance of
every data object.
4. push a critical forecast through as an “urgent project”. This involves preempting running
programs that are not critical in order to do a forecast about a storm that may be very damaging.
The meteorology/atmospheric science student needs
1. intuitive and easy to use tools for hands-on forecast creation and analysis.
2. mechanisms for students to work in groups and share experiments and data.
3. on-line education modules that direct the student through different experiments they can do with
the tools and data available through the gateway.
The LEAD gateway is currently in use by the scientists on the LEAD research team. Most recently it has
played a role in the spring 2008 tornado forecast experiments through the NOAA National Center for
Environmental Prediction. In the spring of 2007 the LEAD gateway was used by student teams
participating in the National Weather Challenge competition.
[Figure LEAD]. The Web-Service architecture for the LEAD Science gateway is built from a set of core
persistent web services that communicate both directly and through an event notification bus. Key
components include a data subsystem (the MyLEAD Agent and User Metadata catalog) for cataloging the
users experimental results, a workflow execution engine that orchestrates the users forecast data analysis
and forecast simulations, an application factory that manages transient application services which controls
individual job executions on the back-end TeraGrid computers, and a Fault Tolerance and Scheduling
system to handle failures in the workflow execution.
The LEAD Gateway is the result of a dedicated team of researchers including Suresh Marru, Beth Plale,
Marcus Christie, Dennis Gannon, Gopi Kandaswamy, Felix Terkhorn, Sangmi Lee Pallickara and
students Scott Jensen, Yiming Sun, Eran Chinthaka, Heejoon Chae, You-Wei Cheah, Chathura Herath, Yi
Huang, Thilina Gunarathne, Srinath Perera, Satoshi Shirasuna, Yogesh Simmhan and Alek Slominski.
5. nanoHUB.org
The NSF-funded NCN was founded in 2002 on the premise that computation is underutilized in vast
communities of experimentalists and educators. From seven years of experience (1995-2002) delivering
online simulations to about 1,000 annual users we had seen that Cyberinfrastructure can lower barriers
and lead to the pervasive use of simulation. Now the NCN operates nanoHUB.org as an international
cyber-resource for nanotechnology with less than 3 days of downtime, over 62,000 users in 172 countries,
6,200 of whom running over 300,000 simulations, without ever submitting grid certificates, mpirun
commands or alike in the past year [ref nanoHUB-stats].
The primary nanoHUB target audience is not the small group of computational scientist who are experts
in computing technologies like HPC, mpi, grids, scheduling and visualization, but researchers and
educators who know nothing about these details. Since users may not even have the permission to install
software everything should happen through a web browser. Numerical “What if?” experiments must be
set-up and data must be explored interactively without downloads. Any down/upload a must happen with
a button click without any grid awareness. These requirements precluded the use of typical technology
and NCN created a new infrastructure. The impact of interactive simulation is evident in the annualized
user graph where in June 2005 we began to convert Web form applications to full interactivity and
increased the annual simulation user numbers six-fold. Web forms were retired in April 2007.
The dissemination of the latest tools demands rapid deployment under minimal investments. This
imposed to another critical constraint – most nano application developers know nothing about Web/grid
services and have no incentive to learn about these. Therefore NCN developed Rappture which manages
input/output in C, Fortran, Matlab, Perl, Tcl, Python, or Ruby or defined workflow combinations thereof,
without any software rewrites and creates graphical user interfaces automatically. Transferring a code
from a research team to a Web team might require years of development to deploy a single tool. Instead
we left the tools in the hands of the researchers and enabled them to develop and deploy over 89
simulation tools in less than three years. Our nanoFORGE.org now supports over 200 software projects.
nanoHUB’s new middleware Maxwell is a scalable, stable, and testable production level system that
manages the delivery of the nanoHUB applications transparently to the end user’s browser through virtual
machines, virtual clusters, and grid computing environments.
nanoHUB had to deliver online simulation “and more” to support the community with tutorials, research
seminars, and a collaboration environment. The community must be enabled to rate and comment on the
content in a Web 2.0 approach and enabled to upload content without administrator interference. With
over 970 “…and more” content items and over 60,000 annual users nanoHUB has become a Web
publisher.
[Figure nanoHUB]: (a) Development of annual nanoHUB simulation users over time. A dramatic
increase in simulation users can be observed coincident with the introduction of interactive simulation
tools. (b) interactive x-y simulation data window for the Quantum Dot Lab powered by NEMO 3D.
Users can compare Data sets from different simulations (for example here quantum dot size variations) by
clicking on the data selector, zoom in and read off data by pointing at it, and download data with a single
click (green arrow). (c) Volume rendered 3D wavefunction of the 10th electron state in a pyramidal
quantum dot in the Quantum Dot Lab. Users can interactively rotate, change isosurfaces, insert cutplanes,
all hardware rendered on a remote server.
12 years of online simulation usage data document that the number of tool users is inversely proportional
to the tool execution time. Most tools should therefore deliver results in seconds, which implies that they
need to be computationally light-weight. Selected capabilities are shown in [Figure nanoHUB]. Once
users are satisfied with the general trends, they might consider running finer models that might take tens
of minutes, the lunch hour, or overnight. This is the opportunity for Grid computing to deliver the needed
computing power rapidly and reliably. nanoHUB currently delivers 6 tools that are powered by the
TeraGrid and the Open Science Grid. 145 users ran >2,800 jobs that consumed >7,500 CPU hours with a
total of >22,000 hours wall time.
nanoHUB impact can be demonstrated by metrics such as the use in over 40 undergraduate and graduate
classes in the 07/08 academic year in over 18 U.S. institutions [ref cite_nanoHUB]. 261 citations refer to
nanoHUB in the scientific literature with about 60% stemming from non-NCN affiliated authors. Even
applications that might be considered “educational” by some are used in research work published in high
quality journals. There are 21 papers that describe nanoHUB use in Education. nanoHUB deminishes
the distinction between research and educational use. nanoHUB services cross-fertilize various nano
subdomains and is broadening its impact to new communities.
The NCN is now assembling the overall software consistent of a customized content management system,
Rappture, and Maxwell into the HUB0 package. HUB0 already powers 4 new HUBs and over a dozen
more are in the pipeline (see HUBzero.org).
6. cancer Biomedical Informatics Grid (caBIG)
In February 2004, the National Cancer Institute (NCI) launched the caBIG™ (cancer Biomedical
Informatics Grid™)[ref cabig-1] initiative to speed research discoveries and improve patient outcomes by
linking researchers, physicians, and patients throughout the cancer community. Envisioned as a World
Wide Web for cancer research, the program is participated in by a diverse collection of nearly 1000
individuals and over 80 organizations including federal health agencies, industry partners, and more than
50 NCI-designated comprehensive cancer centers. Driven by the complexity of cancer and recent
advances in biomedical technologies capable of generating a wealth of relevant data, the program
encourages and facilitates the multi-institutional sharing and integration of diverse data sets, enabled by a
grid-based platform of semantically interoperable services, applications, and data sources [ref cabig-2].
The underlying grid infrastructure of caBIG™ is caGrid [ref cabig-3], a Globus 4 based middleware
which provides the implementation of the required core services, toolkits and wizards for the
development and deployment of community provided services, and programming interfaces for building
client applications. Since its first prototype release in late 2004, caGrid has undergone a number of
community-driven evolutions to its current form that allows for new services and applications, which
leverage the wealth of data and metadata in caBIG™, to be rapidly created and used by members of the
community without any experience in grid middleware.
Once the foundation for semantic interoperability was established and a number of data sources were
developed, attention to the processing of vast amounts of data (some of it large) became more of a priority
to the caBIG™ community. While a plethora of software and infrastructure exists to address this
problem, there are a few impediments to leveraging them in the caBIG™ environment. The first is that
the majority of the participants in the program do not generally have access to the High Performance
Computing (HPC) hardware necessary for such compute or data intensive analysis, and the program’s
voluntary, community-driven, and federated natures preclude such centralized large-scale resources.
Secondly, even given access to such resources, the community of users that would utilize them generally
do not have the experience necessary to leverage the tools of the trade commonly associated with such
access (such as low level grid programming or executable-oriented batch scheduling). That is, a large
emphasis of caGrid is making the grid more accessible, allowing the developer community to focus on
creating science-oriented applications aimed at cancer researchers. Confounding this issue is the fact that
services in the caBIG™ environment present an information model view of the data over which they
operate. This information model provides the foundation for semantic annotations drawn from a shared
controlled terminology which, when exposed as metadata of a grid service, enable the powerful
programmatic discovery and inference capabilities key to the caGrid approach. This model is in stark
contrast to the traditionally file and job oriented views provided by HPC infrastructure.
The TeraGrid Science Gateway program provides a convenient paradigm to address both these issues.
Firstly, it offers an organization and corresponding policies by which communities of science-oriented
users can gain access to the wealth of HPC infrastructure provided TeraGrid, under the umbrella of a
common goal (as opposed to each researcher being responsible for obtaining TeraGrid allocations).
Secondly, it provides a pattern by which such infrastructure can be abstracted away from the presentation
of said resources to the consuming researcher. The most common manifestation of this pattern is the use
of a web application, or portal, which provides researchers a science-oriented interface, enabled by the
backend HPC resources of TeraGrid. However, in caBIG™ the researcher community uses applications
(which may be web portals or desktop applications) that leverage the grid of caBIG™ in the form of a
variety of grid services. In this way, a caGrid Science Gateway must be a grid service which can bridge
the TeraGrid and caBIG grids by virtualizing access to TeraGrid resources into a scientifically-oriented
caGrid service. The users of the gateway service may be completely unaware of its underlying
implementation’s use of TeraGrid, much in the same way as a traditional Science Gateway portal user
may not be aware or being concerned about such issues as TeraGrid accounts, allocations, or resourceproviding service interfaces (such as the Grid Resource Allocation and Management (GRAM) service, for
example).
In our initial reference implementation, we provided a gateway grid service which performed hierarchical
cluster analysis of genomic data [ref cabig-4]. This routine was selected as it was an existing analysis
routine available to the caBIG™ community with only modest computational needs; we aimed to
concentrate our efforts in understanding the policies, processes, and technologies necessary to reproduce
this gateway approach for numerous other routines and problem domains of caBIG™. We were able to
successfully demonstrate the utility of this approach, by utilizing the created gateway service in the
existing geWorkbench desktop application to perform an analysis and visualize the results. Beyond the
actual software created from this effort, a set of best practices and guidance documents were created to
better inform future such efforts.
The initial prototype implementation of a caGrid Science Gateway service was extremely well received
by the caBIG™ community and set the groundwork in terms of process and implementation pattern for a
large number of additional such gateway services to be developed for other problem domains. However,
in order to achieve this goal it is likely additional tooling must be created to simply the process further.
This investment is likely worth the effort, as allowing caBIG™ application developers to continue to
work against a common environment of strongly-typed, semantically described grid services, yet harness
the computing power of TeraGrid is a powerful notion given the large quantities of data and analysis
routines expected to be developed in caBIG™.
8. Conclusions
A number of lessons have been learned in the first phase of Science Gateway construction, testing and use
and we list several of most significant ones here. While the impact on a scientific field can be
tremendous, one major lesson is that the effort required a gateway from scratch can also be very large. It
requires a substantial team of both cyberinfrastructure specialists and domain scientists to design and
construct the core software and services. Consequently, the TeraGrid Gateway program has focused on
gathering and deploying common Gateway building blocks, such as portal frameworks, security tools
(certificate repositories and auditing tools), common services, re-usable workflow engines, application
“wrapping” and deployment tools and directory services. A reasonable goal for the Gateway program
should be to provide any team of dedicated domain experts with a toolkit, training documentation, and
technical assistance that would make it possible for them to build and populate a gateway in a week.
While we are not there yet, we feel it is within reach. The HUBzero framework that powers
nanoHUB.org is currently being deployed for 5 other HUBs and will be available as open source in 2009.
Traditionally, computational scientists have been very patient and persistent, willing to deal with complex
hardware and software systems that may have high failure rates. Because supercomputers still use batch
queues, the user must sometimes wait for hours to see if the latest changes to a runtime script will result
in a successful execution without any intermediate feedback. On the other hand, users of Science
Gateways expect nothing less than the same easy of use, on-demand service and reliability that they have
grown to expect form commercial portals that cater to on-line financial transactions or information
retrieval. The key emphasis must be placed here on true usability. A gateway that requires system
administrator type knowledge to operate even simple operations cannot spread science into new user
bases. User-friendly and stable gateways have therefore the opportunity to open HPC modeling
capabilities to a broad user base that would have never been reached by UNIX-based TeraGrid access.
TeraGrid has had to devote a substantial effort to making its grid infrastructure reliable and scalable. This
has been especially challenging because a grid of supercomputers is not the same as a commercial data
center that has been designed from the ground up to support very large numbers of interactive remote
users.
Finally, as the examples in this paper illustrate, the most important challenge of building a good science
gateway is to understand where the gateway can truly help solve problems in the targeted domain. It is
frequently the case the greatest benefit to the users is to provide easy access to data and the tools that can
search, filter and mine it. However, to make a truly significant impact, the science gateway must provide
more capability than a user can find on their desktop machine. Yet the gateway must be as easy to operate
as any desktop software without any grid or HPC-specific knowledge - the gateway must be an extension
of the desktop to a more powerful and connected domain. In the future we expect gateways will become
integrated with social networking technologies that can aid students and enable a community of scientists
to share tools and ideas.
Gateway usage can be relatively easily monitored in terms of number of users, simulation runs, CPU
times consumed, papers published, etc, see for example [ref nanoHUB-stats] or [ref GridChemScience].
However true impact on science and education should be formally measured in classroom usage, citations
in the literature, and user case studies, metrics that are more difficult to gather.
9. References
[ref TeraGrid] www.teragrid.org
[ref Globus] www.globus.org
[ref Gridchem_1] Rion Dooley, Kent Milfeld, Chona Guiang, Sudhakar Pamidighantam, Gabrielle Allen,
From Proposal to Production: Lessons Learned Developing the Computational Chemistry Grid
Cyberinfrastructure, Journal of Grid Computing, 2006, 4, 195 –208.
[ref nws] Supported by the NSF NMI Program under Award #04-38312, Oct. 2005- Oct. 2007,
http://nws.cs.ucsb.edu/ewiki/
[ref GridChemScience] https://www.gridchem.org/papers/index.shtml
[ref casa-lead] Beth Plale, Dennis Gannon, Jerry Brotzge, Kelvin Droegemeier, Jim Kurose, David
McLaughlin, Robert Wilhelmson, Sara Graves, Mohan Ramamurthy, Richard D. Clark, Sepi Yalda,
Daniel A. Reed, Everette Joseph, V. Chandrasekar, CASA and LEAD: Adaptive Cyberinfrastructure for
Real-Time Multiscale Weather Forecasting, November 2006, IEEE Computer, Volume 39 Issue 11.
[ref eScienceWF] Yolanda Gil, Ewa Deelman, Mark Ellisman, Thomas Fahringer, Geoffrey Fox, Dennis
Gannon, Carole Goble, Miron Livny, Luc Moreau, Jim Myers, Examining the Challenges of Scientific
Workflows, December 2007, IEEE Computer, Volume 40, No. 12. pp. 24-32
[ref nanoHUB-stats] Almost all nanoHUB user statistics are openly available at
http://nanoHUB.org/usage .
[ref cite-nanoHUB] Gerhard Klimeck, Michael McLennan, Sean B. Brophy, George B. Adams III, Mark
S. Lundstrom,"nanoHUB.org: Advancing Education and Research in Nanotechnology", accepted in IEEE
Computers in Engineering and Science (CISE), Special issue on Education.
[ref cabig-1] http://cabig.cancer.gov/
[ref cabig-2] Cyberinfrastructure: Empowering a "Third Way" in Biomedical Research. Kenneth H.
Buetow (6 May 2005) Science 308 (5723), 821. [DOI: 10.1126/science.1112120]
[ref cabig-3] Scott Oster, Stephen Langella, Shannon L. Hastings, David W. Ervin, Ravi Madduri, Joshua
Phillips, Tahsin M. Kurc, Frank Siebenlist, Peter A. Covitz, Krishnakant Shanbhag, Ian Foster, Joel H.
Saltz, "caGrid 1.0: An Enterprise Grid Infrastructure for Biomedical Research", Journal of the
American Medical Informatics Association, 2007 15 : pp. 138-149.
[ref cabig-4] “Integrating caGrid and TeraGrid”, Christine Hung,
http://www.tacc.utexas.edu/tg08/index.php?m_b_c=accepted
Download