Some notes: 1 – Need a name for the proposal and system. 2 – For obvious reasons, I have emphasized the systems aspects. Your expertise is needed to discuss the application domain, the recommender system, and much more. 3 – Security still needs to be addressed in detail. We can employ GSI to do the right thing, but we need to have a careful statement of how this can be done without requiring expertise from end users. Summary Intellectual Merit Broader Impact Building a Community of Expertise for Computational Chemistry Introduction Despite the advent of inexpensive high performance computing clusters, domain scientists struggle to make effective use of computing hardware. For a few ten thousands of dollars, a researcher may purchase a modest cluster of commodity hardware for performing simulation. Despite the low acquisition price, the effective use of a cluster requires a significant degree of expertise in computing, much greater than that required to employ a single machine. One must become a skilled system administration to install and configure each machine, gain expertise in distributed file and batch systems, and acquire very detailed technical knowledge of simulation software. These requirements create a very large barrier to participation in computational science and create a significant overhead in effort even for those willing to pay the price. Our objective is to make participation in computational science on privately-owned resources accessible to domain scientists with minimal expertise in system management. To accomplish this, we propose to build a distributed system that simplifies and directs the installation, configuration, and operation of clusters for computational science. The key property of this portal is that it allows for the sharing of scientific expertise among a community of researchers. New users can be guided in their selection of software and parameters by the expertise of existing users. As new hardware and software are brought online, they may be automatically validated against trustworthy results computed elsewhere. As new software and new techniques are developed, users are apprised of these new developments and may choose to have them deployed automatically. We propose to construct such a community of expertise for computational chemistry. The portal and its initial users will be located at the University of Notre Dame, however, the community will be open to interested researchers worldwide. The initial users will be X, Y, and Z. The applications will be A, B, and C. Although this research will serve primarily the computational chemistry community, the system design and experience gained should be applicable to other domains. Example Use Case Researcher X purchases a 16-node computing cluster for simulation work. X knows a little bit about Unix, so he is able to install the cluster, compile a simulation, and perform a little simulation work. However, after reinstalling several broken nodes, getting stuck on a failed compile, and struggling to configure the inputs to simulation Z, real work has ground to a halt. Researcher X learns of our system and gains new hope. He downloads a small kernel of software from the project web site, and deploys it on each machine in the cluster. He then logs into the portal web page and sees a list of his own machines, a list of available simulation techniques, and a list of researchers performing similar work. (?) Using an interactive recommendation system, X discovers that software S is appropriate for technique Z. X selects “install and validate S”, and then takes a lunch break. Upon, returning the portal reports that the software has been installed and the validation suite has succeeded. Researcher X then uses the portal to queue large numbers of simulation work to be run on his own machines. After several months of successful work, the portal notifies X that a new version of simulation S and offers to install it. X wants the new features available, and so he assents, but indicates that the installation should not occur until after the current batch of work is complete. When the queued work is complete, the system installs the new version of S, but discovers that the new version fails its validation tests. The system automatically backs out of the old version and informs both X and the developer. X’s machines are left in an operable state with the old version, and X can continue to work happily. System Architecture Figure XX shows the architecture of this system. It has four key components: a kernel of easily deployed software, a portal for directing the use of each cluster, a library of simulation software, and a recommender system for guiding the selection and execution of simulations. This architecture allows the system to assist the user in several distinct activities. Assistance with System Deployment. This project will develop a rapidly deployable software kernel for converting stock computers into members of a community of expertise. A major deliverable will be a small, self-contained, self-installing software package that the investigator may deploy on any machines to which he or she has access. This easily-installed software will unpack the necessary components onto the machine and make itself known to the portal. Once installed, the investigator may contact the portal via a web browser, and then direct the organization and application of his/her own machines. Drawing on the experience of others, the investigator may direct the portal to suggest configurations, install software, and run simulations. It is vital to note that the individual does not cede control of his or her machines to the central system. Unlike systems such as Seti@Home (citation) or BOINC (citation), the central server does not dictate what problems are to be attacked. Unlike a grid computing system (citations), the individual is not sending work to a remote site to be completed. Rather, the individual investigator retains complete control of how his resources are to be used: nothing is done to a machine without the explicit consent of its owner. The central service is a point of contact that simplifies installation, recommends configuration, and shares expertise among multiple researchers. The components of the rapidly deployable kernel will be drawn primarily from the NSF National Middleware Initiative (NMI) toolkit for grid computing. The kernel will contain components of the Condor system for distributed computing, the Globus Grid Security Infrastructure (GSI) and other components as necessary. These powerful but complex tools will be employed in the kernel but generally remain hidden from the end user behind a installation facade. Dr. Thain has extensive experience using these tools to build distributed systems in many settings. (citation) Most recently, the BAD-FS system for data-intensive distributed computing was based on a rapidly-deployable software kernel that was employed on a multi-institutional system engaging 160 independent nodes. (citation) By taking the first step of deploying a software kernel, the investigator has joined a community of researchers. The collective expertise of this community may then be applied in three distinct ways: Assistance with Software Selection. Describe the recommender system here... Assistance with Validation. The deployment of new hardware of software for simulation requires validation: it must be verified that the system ensemble produces correct results. However, validation of computing systems can be frustratingly subtle. Clearly, simulation results are affected by processor architecture and application software. However, they can also be influenced by changes in system libraries, varying installation methods, lossy network devices, defective memory chips, and faulty storage devices. Thus, any conservative scientific activity requires that every computing installation be validated independently at installation as well as after even the most benign system configuration changes. Re-validation may also be required after simulation software is upgraded or patches applied. Naturally, the process of validation on every minor change can be both time consuming and technically complex. However, given the extant ability of the system to configure and execute code on remote machines, validation is drastically simplified. The user may enter the portal, select the desired software suite, select “validate” and then return when the process is complete. By decreasing the effort to perform validation, this system both decreases the expense of system changes and raises the trustworthiness of systems and the results that they produce. Assistance with Confirmation. ((Perhaps we can allow users to request other sites to confirm interesting results by reproducing them?)) Assistance with Software Management. Working software is in a constant state of flux. New hardware, new operating systems, new features, and new bugs ensure that simulation software is constantly developing. (For example PROTOMOL has X versions in Y years...) This constant evolution is both a curse and a blessing. On one hand, users wish to keep up to date with software so that they are able to harness new platforms and employ new features. On the other hand, users are reluctant to upgrade software for fear of encountering compatibility programs or new simulation errors. The changing landscape of computing is also a burden for developers, who may not have the resources to test software on all platforms that their users may be interested in. The collected expertise of the community can be applied to the problem of software management. As new versions of software are entered into the system, a selection of users will be asked if they wish to test the new software. If so, the system will download, install, validate, and benchmark the software in an unattended mode. The results are made known to the requesting user. More importantly, the results are made known to the community. Both early adopters and mainstream users may query the system for compatibility, performance, and validation matrices showing how well various software packages perform on different platforms. This may be used to guide upgrade decisions as well as future hardware purchases. Conversely, the authors of such software may take advantage of the expertise generated by the combined users of the system. We fully expect that the mix of hardware and operating systems within the community will change over the course of the project. A diverse pool of hardware and software forms an ideal test lab for developers. Given the explicit permission of machine owners, the system may automatically deploy, test, and validate new version of software on multiple platforms. Developers may be apprised of both compatibility and correctness errors on a wide variety of systems at once, thus improving productivity and software quality. Related Work BioSimGrid – Different community, centralized control. NCBI – Different community, centralized resources, static (and small) software base. Grid in general – Makes resources accessible to outsiders. P2P in general – Donates resources to central project. ROCKS – Allows for the configuration of clusters and system software, but not the software per se. Relation to Other Projects GEMS Plan of Work System Deployment Recommender System Validation and Confirmation Software Management 2006 Software Kernel Simple Portal Static w/ Kernel 2007 Remote Software Deployment Manual Additions 2008 Automatic Config of New Nodes Automatic Upgrade/Backout 2009 Resource Sharing Integrate with Software Eng. Methods 2010 Personnel Dr. Izaguirre Dr. Thain Postdoc Programmer Grad 1 – Grad 2 – Grad 3 – Users and Applications