TeraGrid Science Gateways, Virtual Organizations and Their Impact on Science Nancy Wilkins-Diehr1, Dennis Gannon2, Gerhard Klimeck3, Scott Oster4, Sudhakar Pamidighantam5 1 San Diego Supercomputer Center, University of California at San Diego 2 Indiana University 3 Network for Computational Nanotechnology (NCN), Birck Nanotechnology Center, Purdue University 4 The Ohio State University 5 National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign wilkinsn@sdsc.edu, gannon@cs.indiana.edu, gekco@purdue.edu, oster@bmi.osu.edu, spamidig@ncsa.uiuc.edu ABSTRACT Increasingly, scientists are designing new ways to present and share the tools they use to marshal the digital resources needed to advance their work. The growth of the World Wide Web and increasingly available digital data – through sensors, large-scale computation, and rapidly growing community codes continues to drive this transformation of science. Providing high end resources such as those available through the TeraGrid to scientists through community-designed interfaces enables far greater capabilities while working in familiar environments. This was the genesis of the Gateway program in 2004. This paper describes the TeraGrid Science Gateway program and highlights four successful gateways GridChem, Linked Environments for Atmospheric Discovery (LEAD), nanoHUB.org and the cancer Biomedical Informatics Grid (caBIG). Keywords: science gateways, Web portals, grid computing, cloud computing 1. Introduction The TeraGrid project [ref TeraGrid] began in 2001 as the Distributed Terascale Facility (DTF). Computers, visualization systems and data at four sites were linked through a dedicated 40-gigabit optical network. Today the TeraGrid today includes 25 platforms at 11 sites and provides access to over a petaflop of computing power and petabytes of storage. The TeraGrid has three primary focus areas – deep, wide and open. The goal of TeraGrid deep is to support the most challenging computational science activities – those that couldn’t be achieved without TeraGrid facilities. TeraGrid wide seeks to broaden the set of people who are able to make use of the TeraGrid. The TeraGrid open component seeks compatibility with peer grids and information services that allow development of programmatic interfaces to the TeraGrid. This paper discusses the TeraGrid Science Gateways, a part of TeraGrid’s wide initiative. Our goal is to motivate and explain the Gateway concept and describe services that have developed within TeraGrid to support gateways. To illustrate these points we will describe four gateways - GridChem, Linked Environments for Atmospheric Discovery (LEAD), nanoHUB.org and the cancer Biomedical Informatics Grid (caBIG). The paper will conclude with lessons learned, recommendations for when gateways are appropriate for a science community and some future directions the project will take. 2. TeraGrid Science Gateways The TeraGrid Science Gateways program began in late 2004 with the recognition that scientists were designing their own specialized user interfaces and tools to marshal digital resources for their work. Much of this work was in response to the growth of the World Wide Web and increasingly available digital data – both through sensors and large scale computation. It had become clear to the designers of the TeraGrid that by providing scientists high-end resources such as supercomputers and data archives through application-oriented interfaces designed by members of the science community, this would provide unprecedented capabilities to researchers not well versed in the arcane world of high performance computing. This was the genesis of the Gateway program. Historically, access to high-end computing resources has been restricted to students who work directly with a funded “principal investigator” who has been granted access to the resources. Gateways typically operate through direct web access or through downloadable client programs. This greatly reduces the barrier to entry and opens the door to exploration by a much wider set of researchers. The ramifications of this access are profound. Not only can the best minds, regardless of location, be brought to bear on the most challenging scientific problems but individuals with a completely new perspective can become involved in problem solving. With an increasing set of problems requiring cross-disciplinary solutions, gateways can clearly have a major impact on scientific discovery. Because students can also be involved regardless of institutional affiliation, Gateways can increase participation among underrepresented groups and contribute to workforce development. One can characterize a Science Gateway as a framework of tools that allows the scientist to run applications with little concern for where the computation actually takes place. This is closely related to the concept of “cloud computing” in which applications are provided as web services that run on remote resources in a manner that is not visible to the end-user. Furthermore a Science Gateway is usually more than a collection of applications. Gateways often allow users to store, manage, catalog and share large data collections or rapidly evolving novel applications that cannot be found anywhere else. The level of cloud-like abstraction and virtualization varies from one Gateway to another as different disciplines have different expectations regarding the degree to which its users wish or need to deal with the underlying TeraGrid resources. Usually a Gateway has a core team of supercomputing savvy developers that build and deploy the applications on the resources. These applications become the services that can be used by the larger community. Much of what the TeraGrid Science Gateway team does is to provide tools and services for these core scientific Gateway developers. Other science gateways are more focused on the usability of the services to a very broad user base and spent significant effort on the creation of friendly graphical user interface technology and smooth transition of data between various computational resources. Delivery of computing cycles rapidly, interactively, without wait times, and any specific grid knowledge has proven to be critical there. 3. TeraGrid Support for Gateways Currently the TeraGrid provides back end resources to 29 Gateways which span disciplines and technology approaches. These gateways are developed independently by the various communities specifically to meet defined needs. This is indeed what makes gateways so powerful. TeraGrid, however, must develop scalable service solutions to meet the variety of needs that result from this decentralized development. Early in the Gateway program, TeraGrid worked in a focused way with ten individual gateways. Through an in depth survey, we developed an understanding of what common services were needed to meet the needs of this very new user community in a scalable way. .Commonalities emerged in the areas of Web services, auditing, community accounts, scalable, stable, production-level middleware, flexible allocations and CPU resource scheduling, and job meta-scheduling. Some Gateway developers rely on the convenience provided via Web services and have requested such capabilities from the TeraGrid. Version 4 of the Globus Toolkit [ref Globus], installed in production across the TeraGrid in 2006, included Web service support for tasks such as job submission and data transfer. Web services are used by approximately 25% of Gateways. Work continues to augment the basic functionality provided by Globus with TeraGrid-specific information. For example job submission interfaces may be further developed to include data which could be used to answer questions such as “Where can I run my 16 processor job that requires 20MB of memory the soonest? Where can I run an 18-hour, 8-processor job using Gaussian the soonest? Which sites support urgent computing for a 32processor job?” and “Do any of these selections changed if a 25GB input file needs to be transferred to the site of interest?” Since some gateways provide as well as consume Web services, we are actively working on a registry where gateway developers can list and share services with each other as well as with potential TeraGrid users. Today, researchers interested in making use of software on the TeraGrid check a catalog that lists software available from the command line on all TeraGrid resource provider platforms. In the future, we envision researchers and developers checking a registry of applications – programmatically or manually. On the TeraGrid, Gateways are primarily enabled by a concept called community accounts. These accounts allow the TeraGrid to delegate account management, accounting, certificate management and user support directly to the Gateways. In order to responsibly delegate these tasks however, TeraGrid must provide management and accounting tools to Gateway developers and store additional attributes for jobs submitted through community accounts. Some resource provider sites further secure these accounts by using the community shell, commsh. Early in the Gateway program, the Globus team developed an extension to Grid Resource Allocation and Management (GRAM) called GRAM audit. This is a fundamental extension that allows Gateway developers to retrieve the amount of CPU hours consumed by a grid job after it has completed. With jobs for many independent Gateway users being run from a single community account, such a capability is quite important. Currently the TeraGrid team is developing GridShib for use by Gateway developers so that attributes unique to a Gateway user can be stored when jobs are submitted to the TeraGrid. This will provide the TeraGrid with a programmatic way to count end Gateway users using the TeraGrid,and will also allow the TeraGrid to do per user accounting for Gateways as needed. Supporting a wide variety of gateways in an environment as extensive as the TeraGrid can be very challenging and requires a real commitment to scalable infrastructure. For example in 2007 the Linked Environments for Atmospheric Discovery (LEAD) gateway was used by students from X institutions in a weather forecast competition. The nature of the competition led to concentrated bursts of job submissions and data transfer requests. System stability during this period was not what we hoped and resulted in an extensive collaboration between Globus team members and TeraGrid staff to understand and address both the hardware and software issues associated with handling the large load. In addition, when using shared resources there are sometimes unscheduled outages. Programming methods used by Gateway developers must be fault tolerant. As Gateway use of the TeraGrid expands, periods of high load will be even less predictable and a very robust infrastructure must be deployed and failover models used by developers to ensure stable services. Addressing these issues continues as a focus area. Finally the TeraGrid has had to adapt its allocation and scheduling policies to meet the needs of Gateways. Because usage cannot be planned as it can with single investigators and his or her research team, flexible allocation policies are needed. When requesting resources, principal investigators must be able to describe in general terms how they expect their Gateway to be used and the TeraGrid must be able react if a gateway is more successful than anticipated,. Service interruptions must be avoided to ensure the continuity and reliability that are so important to a successful Gateway. Gateways also tend to have greater interactive needs than researchers working at the command line. Gateway developers have been early testers in TeraGrid’s metascheduling efforts. Finally the Special Priority and Urgent Computing Environment (SPRUCE) has been developed within TeraGrid to meet urgent computing needs, where simulations might be required immediately, for example resulting from sensor data feedback. This paper highlights several Gateways and includes treatment of their infrastructure, the scientific capabilities available to the research community through the gateway and the impact they have had. Featured Gateways include GridChem, LEAD, nanoHUB.org and the cancer Biomedical Informatics Grid (caBIG). 4. Computational Chemistry Grid (GridChem) Molecular sciences occupy a central position in research to understand material behavior. Most physical phenomena from atmospheric and environmental effects to medicinal interactions within cells are mediated through forces at the molecular and atomic level. Nano device modeling and material design also involve atomic level detail. Chemical, biological and material modeling require access to unique and coupled applications to produce accurate and detailed descriptions of atomic interactions. The insights gained help researchers both to understand known phenomenology and to predict new phenomenology for designing novel and appropriate materials and devices for a given task. Computational Chemistry has evolved as a field of study now adopted across multiple disciplines to delineate interactions at a molecular and atomic level and to describe system behavior at a higher scale. The need for integrative services to drive molecular level understanding of physical phenomenon has now become acute as we enter an era of rapid prototyping of nano scale actors, multi-discipline engagement with chemical sciences and diverse and complex computational infrastructures. Automated computational molecular science services will transform the many fields that depend on such information for routine research as well as advance fields of societal significance such as safe and effective drug development and the development of sustainable food and energy sources. A Science Gateway meets researchers’ needs for both integrative and automated services. The GridChem Science Gateway serves the computational chemistry community through support from the Computational Chemistry Grid(CCG), funded under the NSF NMI-Deployment grants starting in Fall of 2004 [ref GridChem_1]. The goal of CCG and the GridChem Science Gateway is to bring national high performance computational resources to practicing chemists through intuitively familiar native interfaces. Currently the GridChem Science gateway integrates pre- and post- processing components with the high performance applications and provides data management infrastructures. CCG staff members include chemists, biologists and chemical physicists who design interfaces and integrate applications, programming experts who layer user friendly interfaces atop the domain requirements and finally high performance computing experts who coordinate the usage of the integrated services and resources by the community at large. A three tier architecture as depicted in [Figure GridChem] below consists of a Client Portal which presents an intuitive interface for data and pre and post processing components, a Grid Middleware Server which coordinates communication between client and the deployed applications through web services and applications deployed on the high performance resources. The software architecture is supported with a consulting portal for user issues and an education and outreach effort to disseminate the features and usability and engage the community for their changing needs. GridChem Client Portal. The GridChem Client is a Java application that can either be launched from the GridChem web pages or downloaded to the local desktop. This serves as the portal to the Computational Chemistry Grid infrastructure and consists of authentication, graphical user interfaces for molecular building, input generation for various applications, job submission and monitoring and post processing modules. The authentication is hierarchical to serve the “Community User” as well individual users with independent allocations at various HPC resources. The “Community User” is an individual user whose CPU time on both TeraGrid and CCG partnership is completely managed by CCG. The MyProxy credential repository is used to manage credential delegation. Proxy credentials are used for various services where authentication is required. The Job Submission interface provides multiple ways of generating the input files, a resource discovery module with dynamic information and job requirement definition options. Multiple independent jobs can be created using this infrastructure and jobs with diverse requirements can be launched simultaneously. The client also provides a MyCCG module which is central to job-centric monitoring, post processing and potential resubmission mechanisms. MyCCG provides status and usage data for individual or members of a whole group under a Principal Investigator apart from job specific monitoring. GridChem Middleware Services The GridChem Middleware Services provide the service architecture for the GridChem client. The services use Globus-based Web Services to integrate applications and hardware. Application input specification requirements with default input parameters queue information and a simple metascheduling capability based on a “deadline prediction” module supported by Network Weather Service (NWS) [ref. nws] are provided. The Middleware Services allow MyCCG to monitor jobs, provide individual and group usage data and support dynamic data ingestion with metadata into information repositories. Hardware Resources and Application Integration Several popular applications are deployed and supported by HPC staff at resource provider sites. The GridChem Science Gateway leverages such deployments and abstracts the information needed for their discovery into the GridChem Web Services database. GridChem tests restart or checkpoint capabilities, where available, and periodically updates software as needed. Application software access is controlled at an individual user level for an individual resource. Development and Deployment User Services Client Portal Allocation Management Consulting and Trouble shooting Outreach Help Book Training Materials User Surveys Presentations TG Student Challenge Grid Middleware services Dissemination Application Integration and validation Non-TG CCG TeraGrid Consortium OpenScience Grid Hardware Resources [Figure GridChem] CCG VO for GridChem Science Gateway The GridChem Science Gateway is currently in production, used by a community of about 300 researchers who consuming about 80,000 CPU hours per quarter. In the last year and a half at least 15 publications resulted from such a usage in various reputed journals [ref. GridChemScience]. As we deploy advanced features such as meta-scheduling services and integrate material and biological modeling applications we expect a rapid growth in usage in the coming years. The resulting scientific data can be mined by the community as a whole. The development of automated metadata collection services for such an information archive will enhance these data mining capabilities. 5. Linked Environments for Atmospheric Discovery (LEAD) The National Science Foundation (NSF), as part of the Large Information Technology Research (ITR) program, started the Linked Environments for Atmospheric Discovery Project (LEAD) in September 2003. The goal of the project is to fundamentally change mesoscale meteorology (the study of severe weather events like tornados and hurricanes) from the current practice of static observations and forecasts to a science of prediction that responds dynamically and adaptively to rapidly changing weather. Currently, we cannot predict tornados. However, we can spot conditions that may make them more likely and, when a tornado is spotted, we can track it on radar. Current forecast models are very course and they run on fixed time schedules. To make progress, we must change the basic paradigm. More specifically, the LEAD cyberinfrastructure is designed to allow any meteorologist to grab the current weather data and create a specialized high-resolution weather forecast on-demand. In addition, the system provides scientists with the ability to create new prediction models using tools that allow the exploration of past weather events and forecasts. Finally, as described in [ref casa-lead], a major goal of the project is to find ways in which the weather forecast system can adaptively interact with the measuring instrumentation in a “closed loop” system. Weather forecasts are, in fact, rather complex workflows that take data from observational sensors, filter and normalize it, then assimilate it into a coherent set of initial and boundary conditions, then feed that data to a forecast simulation and finally to various post processing steps. In all, a typical forecast workflow may require 7 to 10 data processing or simulation steps, each of which requires moving several gigabytes of data and the execution of a large program. To accomplish these goals, the LEAD team built Web service based cyberinfrastructure with a Science Gateway Portal on the front end and using TeraGrid on the back-end for the computation, data analysis and storage (see [Figure LEAD]). The Gateway is designed to support a wide variety of meteorology users. On one end the core audience is the mesoscale researchers who are keenly interested in developing the new techniques for making accurate on-demand forecasts. On the other end are the high school and college students taking classes on weather modeling and prediction. Both groups have significant requirements that impact the design of the system. The meteorology researchers need the ability to 1. configure forecast workflow parameters including the geographic region, the mesh spacing and various physics parameters and then launch the workflow on-demand. 2. modify the workflow itself, by replacing a standard data analysis or simulation component with an experimental version. Or, if necessary, the scientist may wish to create a new workflow from scratch. (In [ref eScienceWF], we describe the growing importance of workflow technology in contemporary e-Science.) 3. treat each forecast workflow execution as a computational experiment, whose metadata is automatically cataloged. This allows the scientist to return at a later time and discover and compare all experiments will similar parameter settings, or to retrieve the complete provenance of every data object. 4. push a critical forecast through as an “urgent project”. This involves preempting running programs that are not critical in order to do a forecast about a storm that may be very damaging. The meteorology/atmospheric science student needs 1. intuitive and easy to use tools for hands-on forecast creation and analysis. 2. mechanisms for students to work in groups and share experiments and data. 3. on-line education modules that direct the student through different experiments they can do with the tools and data available through the gateway. The LEAD gateway is currently in use by the scientists on the LEAD research team. Most recently it has played a role in the spring 2008 tornado forecast experiments through the NOAA National Center for Environmental Prediction. In the spring of 2007 the LEAD gateway was used by student teams participating in the National Weather Challenge competition. [Figure LEAD]. The Web-Service architecture for the LEAD Science gateway is built from a set of core persistent web services that communicate both directly and through an event notification bus. Key components include a data subsystem (the MyLEAD Agent and User Metadata catalog) for cataloging the users experimental results, a workflow execution engine that orchestrates the users forecast data analysis and forecast simulations, an application factory that manages transient application services which controls individual job executions on the back-end TeraGrid computers, and a Fault Tolerance and Scheduling system to handle failures in the workflow execution. The LEAD Gateway is the result of a dedicated team of researchers including Suresh Marru, Beth Plale, Marcus Christie, Dennis Gannon, Gopi Kandaswamy, Felix Terkhorn, Sangmi Lee Pallickara and students Scott Jensen, Yiming Sun, Eran Chinthaka, Heejoon Chae, You-Wei Cheah, Chathura Herath, Yi Huang, Thilina Gunarathne, Srinath Perera, Satoshi Shirasuna, Yogesh Simmhan and Alek Slominski. 5. nanoHUB.org The NSF-funded NCN was founded in 2002 on the premise that computation is underutilized in vast communities of experimentalists and educators. From seven years of experience (1995-2002) delivering online simulations to about 1,000 annual users we had seen that Cyberinfrastructure can lower barriers and lead to the pervasive use of simulation. Now the NCN operates nanoHUB.org as an international cyber-resource for nanotechnology with less than 3 days of downtime, over 62,000 users in 172 countries, 6,200 of whom running over 300,000 simulations, without ever submitting grid certificates, mpirun commands or alike in the past year [ref nanoHUB-stats]. The primary nanoHUB target audience is not the small group of computational scientist who are experts in computing technologies like HPC, mpi, grids, scheduling and visualization, but researchers and educators who know nothing about these details. Since users may not even have the permission to install software everything should happen through a web browser. Numerical “What if?” experiments must be set-up and data must be explored interactively without downloads. Any down/upload a must happen with a button click without any grid awareness. These requirements precluded the use of typical technology and NCN created a new infrastructure. The impact of interactive simulation is evident in the annualized user graph where in June 2005 we began to convert Web form applications to full interactivity and increased the annual simulation user numbers six-fold. Web forms were retired in April 2007. The dissemination of the latest tools demands rapid deployment under minimal investments. This imposed to another critical constraint – most nano application developers know nothing about Web/grid services and have no incentive to learn about these. Therefore NCN developed Rappture which manages input/output in C, Fortran, Matlab, Perl, Tcl, Python, or Ruby or defined workflow combinations thereof, without any software rewrites and creates graphical user interfaces automatically. Transferring a code from a research team to a Web team might require years of development to deploy a single tool. Instead we left the tools in the hands of the researchers and enabled them to develop and deploy over 89 simulation tools in less than three years. Our nanoFORGE.org now supports over 200 software projects. nanoHUB’s new middleware Maxwell is a scalable, stable, and testable production level system that manages the delivery of the nanoHUB applications transparently to the end user’s browser through virtual machines, virtual clusters, and grid computing environments. nanoHUB had to deliver online simulation “and more” to support the community with tutorials, research seminars, and a collaboration environment. The community must be enabled to rate and comment on the content in a Web 2.0 approach and enabled to upload content without administrator interference. With over 970 “…and more” content items and over 60,000 annual users nanoHUB has become a Web publisher. [Figure nanoHUB]: (a) Development of annual nanoHUB simulation users over time. A dramatic increase in simulation users can be observed coincident with the introduction of interactive simulation tools. (b) interactive x-y simulation data window for the Quantum Dot Lab powered by NEMO 3D. Users can compare Data sets from different simulations (for example here quantum dot size variations) by clicking on the data selector, zoom in and read off data by pointing at it, and download data with a single click (green arrow). (c) Volume rendered 3D wavefunction of the 10th electron state in a pyramidal quantum dot in the Quantum Dot Lab. Users can interactively rotate, change isosurfaces, insert cutplanes, all hardware rendered on a remote server. 12 years of online simulation usage data document that the number of tool users is inversely proportional to the tool execution time. Most tools should therefore deliver results in seconds, which implies that they need to be computationally light-weight. Selected capabilities are shown in [Figure nanoHUB]. Once users are satisfied with the general trends, they might consider running finer models that might take tens of minutes, the lunch hour, or overnight. This is the opportunity for Grid computing to deliver the needed computing power rapidly and reliably. nanoHUB currently delivers 6 tools that are powered by the TeraGrid and the Open Science Grid. 145 users ran >2,800 jobs that consumed >7,500 CPU hours with a total of >22,000 hours wall time. nanoHUB impact can be demonstrated by metrics such as the use in over 40 undergraduate and graduate classes in the 07/08 academic year in over 18 U.S. institutions [ref cite_nanoHUB]. 261 citations refer to nanoHUB in the scientific literature with about 60% stemming from non-NCN affiliated authors. Even applications that might be considered “educational” by some are used in research work published in high quality journals. There are 21 papers that describe nanoHUB use in Education. nanoHUB deminishes the distinction between research and educational use. nanoHUB services cross-fertilize various nano subdomains and is broadening its impact to new communities. The NCN is now assembling the overall software consistent of a customized content management system, Rappture, and Maxwell into the HUB0 package. HUB0 already powers 4 new HUBs and over a dozen more are in the pipeline (see HUBzero.org). 6. cancer Biomedical Informatics Grid (caBIG) In February 2004, the National Cancer Institute (NCI) launched the caBIG™ (cancer Biomedical Informatics Grid™)[ref cabig-1] initiative to speed research discoveries and improve patient outcomes by linking researchers, physicians, and patients throughout the cancer community. Envisioned as a World Wide Web for cancer research, the program is participated in by a diverse collection of nearly 1000 individuals and over 80 organizations including federal health agencies, industry partners, and more than 50 NCI-designated comprehensive cancer centers. Driven by the complexity of cancer and recent advances in biomedical technologies capable of generating a wealth of relevant data, the program encourages and facilitates the multi-institutional sharing and integration of diverse data sets, enabled by a grid-based platform of semantically interoperable services, applications, and data sources [ref cabig-2]. The underlying grid infrastructure of caBIG™ is caGrid [ref cabig-3], a Globus 4 based middleware which provides the implementation of the required core services, toolkits and wizards for the development and deployment of community provided services, and programming interfaces for building client applications. Since its first prototype release in late 2004, caGrid has undergone a number of community-driven evolutions to its current form that allows for new services and applications, which leverage the wealth of data and metadata in caBIG™, to be rapidly created and used by members of the community without any experience in grid middleware. Once the foundation for semantic interoperability was established and a number of data sources were developed, attention to the processing of vast amounts of data (some of it large) became more of a priority to the caBIG™ community. While a plethora of software and infrastructure exists to address this problem, there are a few impediments to leveraging them in the caBIG™ environment. The first is that the majority of the participants in the program do not generally have access to the High Performance Computing (HPC) hardware necessary for such compute or data intensive analysis, and the program’s voluntary, community-driven, and federated natures preclude such centralized large-scale resources. Secondly, even given access to such resources, the community of users that would utilize them generally do not have the experience necessary to leverage the tools of the trade commonly associated with such access (such as low level grid programming or executable-oriented batch scheduling). That is, a large emphasis of caGrid is making the grid more accessible, allowing the developer community to focus on creating science-oriented applications aimed at cancer researchers. Confounding this issue is the fact that services in the caBIG™ environment present an information model view of the data over which they operate. This information model provides the foundation for semantic annotations drawn from a shared controlled terminology which, when exposed as metadata of a grid service, enable the powerful programmatic discovery and inference capabilities key to the caGrid approach. This model is in stark contrast to the traditionally file and job oriented views provided by HPC infrastructure. The TeraGrid Science Gateway program provides a convenient paradigm to address both these issues. Firstly, it offers an organization and corresponding policies by which communities of science-oriented users can gain access to the wealth of HPC infrastructure provided TeraGrid, under the umbrella of a common goal (as opposed to each researcher being responsible for obtaining TeraGrid allocations). Secondly, it provides a pattern by which such infrastructure can be abstracted away from the presentation of said resources to the consuming researcher. The most common manifestation of this pattern is the use of a web application, or portal, which provides researchers a science-oriented interface, enabled by the backend HPC resources of TeraGrid. However, in caBIG™ the researcher community uses applications (which may be web portals or desktop applications) that leverage the grid of caBIG™ in the form of a variety of grid services. In this way, a caGrid Science Gateway must be a grid service which can bridge the TeraGrid and caBIG grids by virtualizing access to TeraGrid resources into a scientifically-oriented caGrid service. The users of the gateway service may be completely unaware of its underlying implementation’s use of TeraGrid, much in the same way as a traditional Science Gateway portal user may not be aware or being concerned about such issues as TeraGrid accounts, allocations, or resourceproviding service interfaces (such as the Grid Resource Allocation and Management (GRAM) service, for example). In our initial reference implementation, we provided a gateway grid service which performed hierarchical cluster analysis of genomic data [ref cabig-4]. This routine was selected as it was an existing analysis routine available to the caBIG™ community with only modest computational needs; we aimed to concentrate our efforts in understanding the policies, processes, and technologies necessary to reproduce this gateway approach for numerous other routines and problem domains of caBIG™. We were able to successfully demonstrate the utility of this approach, by utilizing the created gateway service in the existing geWorkbench desktop application to perform an analysis and visualize the results. Beyond the actual software created from this effort, a set of best practices and guidance documents were created to better inform future such efforts. The initial prototype implementation of a caGrid Science Gateway service was extremely well received by the caBIG™ community and set the groundwork in terms of process and implementation pattern for a large number of additional such gateway services to be developed for other problem domains. However, in order to achieve this goal it is likely additional tooling must be created to simply the process further. This investment is likely worth the effort, as allowing caBIG™ application developers to continue to work against a common environment of strongly-typed, semantically described grid services, yet harness the computing power of TeraGrid is a powerful notion given the large quantities of data and analysis routines expected to be developed in caBIG™. 8. Conclusions A number of lessons have been learned in the first phase of Science Gateway construction, testing and use and we list several of most significant ones here. While the impact on a scientific field can be tremendous, one major lesson is that the effort required a gateway from scratch can also be very large. It requires a substantial team of both cyberinfrastructure specialists and domain scientists to design and construct the core software and services. Consequently, the TeraGrid Gateway program has focused on gathering and deploying common Gateway building blocks, such as portal frameworks, security tools (certificate repositories and auditing tools), common services, re-usable workflow engines, application “wrapping” and deployment tools and directory services. A reasonable goal for the Gateway program should be to provide any team of dedicated domain experts with a toolkit, training documentation, and technical assistance that would make it possible for them to build and populate a gateway in a week. While we are not there yet, we feel it is within reach. The HUBzero framework that powers nanoHUB.org is currently being deployed for 5 other HUBs and will be available as open source in 2009. Traditionally, computational scientists have been very patient and persistent, willing to deal with complex hardware and software systems that may have high failure rates. Because supercomputers still use batch queues, the user must sometimes wait for hours to see if the latest changes to a runtime script will result in a successful execution without any intermediate feedback. On the other hand, users of Science Gateways expect nothing less than the same easy of use, on-demand service and reliability that they have grown to expect form commercial portals that cater to on-line financial transactions or information retrieval. The key emphasis must be placed here on true usability. A gateway that requires system administrator type knowledge to operate even simple operations cannot spread science into new user bases. User-friendly and stable gateways have therefore the opportunity to open HPC modeling capabilities to a broad user base that would have never been reached by UNIX-based TeraGrid access. TeraGrid has had to devote a substantial effort to making its grid infrastructure reliable and scalable. This has been especially challenging because a grid of supercomputers is not the same as a commercial data center that has been designed from the ground up to support very large numbers of interactive remote users. Finally, as the examples in this paper illustrate, the most important challenge of building a good science gateway is to understand where the gateway can truly help solve problems in the targeted domain. It is frequently the case the greatest benefit to the users is to provide easy access to data and the tools that can search, filter and mine it. However, to make a truly significant impact, the science gateway must provide more capability than a user can find on their desktop machine. Yet the gateway must be as easy to operate as any desktop software without any grid or HPC-specific knowledge - the gateway must be an extension of the desktop to a more powerful and connected domain. In the future we expect gateways will become integrated with social networking technologies that can aid students and enable a community of scientists to share tools and ideas. Gateway usage can be relatively easily monitored in terms of number of users, simulation runs, CPU times consumed, papers published, etc, see for example [ref nanoHUB-stats] or [ref GridChemScience]. However true impact on science and education should be formally measured in classroom usage, citations in the literature, and user case studies, metrics that are more difficult to gather. 9. References [ref TeraGrid] www.teragrid.org [ref Globus] www.globus.org [ref Gridchem_1] Rion Dooley, Kent Milfeld, Chona Guiang, Sudhakar Pamidighantam, Gabrielle Allen, From Proposal to Production: Lessons Learned Developing the Computational Chemistry Grid Cyberinfrastructure, Journal of Grid Computing, 2006, 4, 195 –208. [ref nws] Supported by the NSF NMI Program under Award #04-38312, Oct. 2005- Oct. 2007, http://nws.cs.ucsb.edu/ewiki/ [ref GridChemScience] https://www.gridchem.org/papers/index.shtml [ref casa-lead] Beth Plale, Dennis Gannon, Jerry Brotzge, Kelvin Droegemeier, Jim Kurose, David McLaughlin, Robert Wilhelmson, Sara Graves, Mohan Ramamurthy, Richard D. Clark, Sepi Yalda, Daniel A. Reed, Everette Joseph, V. Chandrasekar, CASA and LEAD: Adaptive Cyberinfrastructure for Real-Time Multiscale Weather Forecasting, November 2006, IEEE Computer, Volume 39 Issue 11. [ref eScienceWF] Yolanda Gil, Ewa Deelman, Mark Ellisman, Thomas Fahringer, Geoffrey Fox, Dennis Gannon, Carole Goble, Miron Livny, Luc Moreau, Jim Myers, Examining the Challenges of Scientific Workflows, December 2007, IEEE Computer, Volume 40, No. 12. pp. 24-32 [ref nanoHUB-stats] Almost all nanoHUB user statistics are openly available at http://nanoHUB.org/usage . [ref cite-nanoHUB] Gerhard Klimeck, Michael McLennan, Sean B. Brophy, George B. Adams III, Mark S. Lundstrom,"nanoHUB.org: Advancing Education and Research in Nanotechnology", accepted in IEEE Computers in Engineering and Science (CISE), Special issue on Education. [ref cabig-1] http://cabig.cancer.gov/ [ref cabig-2] Cyberinfrastructure: Empowering a "Third Way" in Biomedical Research. Kenneth H. Buetow (6 May 2005) Science 308 (5723), 821. [DOI: 10.1126/science.1112120] [ref cabig-3] Scott Oster, Stephen Langella, Shannon L. Hastings, David W. Ervin, Ravi Madduri, Joshua Phillips, Tahsin M. Kurc, Frank Siebenlist, Peter A. Covitz, Krishnakant Shanbhag, Ian Foster, Joel H. Saltz, "caGrid 1.0: An Enterprise Grid Infrastructure for Biomedical Research", Journal of the American Medical Informatics Association, 2007 15 : pp. 138-149. [ref cabig-4] “Integrating caGrid and TeraGrid”, Christine Hung, http://www.tacc.utexas.edu/tg08/index.php?m_b_c=accepted