Building a Community of Expertise for Computational Chemistry

advertisement
Some notes:
1 – Need a name for the proposal and system.
2 – For obvious reasons, I have emphasized the systems aspects. Your expertise is
needed to discuss the application domain, the recommender system, and much more.
3 – Security still needs to be addressed in detail. We can employ GSI to do the right
thing, but we need to have a careful statement of how this can be done without requiring
expertise from end users.
Summary
Intellectual Merit
Broader Impact
Building a Community of Expertise for Computational Chemistry
Introduction
Despite the advent of inexpensive high performance computing clusters, domain
scientists struggle to make effective use of computing hardware. For a few ten thousands
of dollars, a researcher may purchase a modest cluster of commodity hardware for
performing simulation. Despite the low acquisition price, the effective use of a cluster
requires a significant degree of expertise in computing, much greater than that required to
employ a single machine. One must become a skilled system administration to install
and configure each machine, gain expertise in distributed file and batch systems, and
acquire very detailed technical knowledge of simulation software. These requirements
create a very large barrier to participation in computational science and create a
significant overhead in effort even for those willing to pay the price.
Our objective is to make participation in computational science on privately-owned
resources accessible to domain scientists with minimal expertise in system management.
To accomplish this, we propose to build a distributed system that simplifies and directs
the installation, configuration, and operation of clusters for computational science. The
key property of this portal is that it allows for the sharing of scientific expertise among a
community of researchers. New users can be guided in their selection of software and
parameters by the expertise of existing users. As new hardware and software are brought
online, they may be automatically validated against trustworthy results computed
elsewhere. As new software and new techniques are developed, users are apprised of
these new developments and may choose to have them deployed automatically.
We propose to construct such a community of expertise for computational chemistry.
The portal and its initial users will be located at the University of Notre Dame, however,
the community will be open to interested researchers worldwide. The initial users will be
X, Y, and Z. The applications will be A, B, and C. Although this research will serve
primarily the computational chemistry community, the system design and experience
gained should be applicable to other domains.
Example Use Case
Researcher X purchases a 16-node computing cluster for simulation work. X knows a
little bit about Unix, so he is able to install the cluster, compile a simulation, and perform
a little simulation work. However, after reinstalling several broken nodes, getting stuck
on a failed compile, and struggling to configure the inputs to simulation Z, real work has
ground to a halt. Researcher X learns of our system and gains new hope. He downloads a
small kernel of software from the project web site, and deploys it on each machine in the
cluster. He then logs into the portal web page and sees a list of his own machines, a list
of available simulation techniques, and a list of researchers performing similar work. (?)
Using an interactive recommendation system, X discovers that software S is appropriate
for technique Z. X selects “install and validate S”, and then takes a lunch break. Upon,
returning the portal reports that the software has been installed and the validation suite
has succeeded. Researcher X then uses the portal to queue large numbers of simulation
work to be run on his own machines. After several months of successful work, the portal
notifies X that a new version of simulation S and offers to install it. X wants the new
features available, and so he assents, but indicates that the installation should not occur
until after the current batch of work is complete. When the queued work is complete, the
system installs the new version of S, but discovers that the new version fails its validation
tests. The system automatically backs out of the old version and informs both X and the
developer. X’s machines are left in an operable state with the old version, and X can
continue to work happily.
System Architecture
Figure XX shows the architecture of this system. It has four key components: a kernel of
easily deployed software, a portal for directing the use of each cluster, a library of
simulation software, and a recommender system for guiding the selection and execution
of simulations. This architecture allows the system to assist the user in several distinct
activities.
Assistance with System Deployment. This project will develop a rapidly deployable
software kernel for converting stock computers into members of a community of
expertise. A major deliverable will be a small, self-contained, self-installing software
package that the investigator may deploy on any machines to which he or she has access.
This easily-installed software will unpack the necessary components onto the machine
and make itself known to the portal. Once installed, the investigator may contact the
portal via a web browser, and then direct the organization and application of his/her own
machines. Drawing on the experience of others, the investigator may direct the portal to
suggest configurations, install software, and run simulations.
It is vital to note that the individual does not cede control of his or her machines to the
central system. Unlike systems such as Seti@Home (citation) or BOINC (citation), the
central server does not dictate what problems are to be attacked. Unlike a grid computing
system (citations), the individual is not sending work to a remote site to be completed.
Rather, the individual investigator retains complete control of how his resources are to be
used: nothing is done to a machine without the explicit consent of its owner. The central
service is a point of contact that simplifies installation, recommends configuration, and
shares expertise among multiple researchers.
The components of the rapidly deployable kernel will be drawn primarily from the NSF
National Middleware Initiative (NMI) toolkit for grid computing. The kernel will contain
components of the Condor system for distributed computing, the Globus Grid Security
Infrastructure (GSI) and other components as necessary. These powerful but complex
tools will be employed in the kernel but generally remain hidden from the end user
behind a installation facade. Dr. Thain has extensive experience using these tools to
build distributed systems in many settings. (citation) Most recently, the BAD-FS system
for data-intensive distributed computing was based on a rapidly-deployable software
kernel that was employed on a multi-institutional system engaging 160 independent
nodes. (citation)
By taking the first step of deploying a software kernel, the investigator has joined a
community of researchers. The collective expertise of this community may then be
applied in three distinct ways:
Assistance with Software Selection. Describe the recommender system here...
Assistance with Validation. The deployment of new hardware of software for simulation
requires validation: it must be verified that the system ensemble produces correct results.
However, validation of computing systems can be frustratingly subtle.
Clearly, simulation results are affected by processor architecture and application
software. However, they can also be influenced by changes in system libraries, varying
installation methods, lossy network devices, defective memory chips, and faulty storage
devices. Thus, any conservative scientific activity requires that every computing
installation be validated independently at installation as well as after even the most
benign system configuration changes. Re-validation may also be required after
simulation software is upgraded or patches applied.
Naturally, the process of validation on every minor change can be both time consuming
and technically complex. However, given the extant ability of the system to configure
and execute code on remote machines, validation is drastically simplified. The user may
enter the portal, select the desired software suite, select “validate” and then return when
the process is complete. By decreasing the effort to perform validation, this system both
decreases the expense of system changes and raises the trustworthiness of systems and
the results that they produce.
Assistance with Confirmation. ((Perhaps we can allow users to request other sites to
confirm interesting results by reproducing them?))
Assistance with Software Management. Working software is in a constant state of flux.
New hardware, new operating systems, new features, and new bugs ensure that
simulation software is constantly developing. (For example PROTOMOL has X versions
in Y years...) This constant evolution is both a curse and a blessing. On one hand, users
wish to keep up to date with software so that they are able to harness new platforms and
employ new features. On the other hand, users are reluctant to upgrade software for fear
of encountering compatibility programs or new simulation errors. The changing
landscape of computing is also a burden for developers, who may not have the resources
to test software on all platforms that their users may be interested in.
The collected expertise of the community can be applied to the problem of software
management. As new versions of software are entered into the system, a selection of
users will be asked if they wish to test the new software. If so, the system will download,
install, validate, and benchmark the software in an unattended mode. The results are
made known to the requesting user. More importantly, the results are made known to the
community. Both early adopters and mainstream users may query the system for
compatibility, performance, and validation matrices showing how well various software
packages perform on different platforms. This may be used to guide upgrade decisions as
well as future hardware purchases.
Conversely, the authors of such software may take advantage of the expertise generated
by the combined users of the system. We fully expect that the mix of hardware and
operating systems within the community will change over the course of the project. A
diverse pool of hardware and software forms an ideal test lab for developers. Given the
explicit permission of machine owners, the system may automatically deploy, test, and
validate new version of software on multiple platforms. Developers may be apprised of
both compatibility and correctness errors on a wide variety of systems at once, thus
improving productivity and software quality.
Related Work
BioSimGrid – Different community, centralized control.
NCBI – Different community, centralized resources, static (and small) software base.
Grid in general – Makes resources accessible to outsiders.
P2P in general – Donates resources to central project.
ROCKS – Allows for the configuration of clusters and system software, but not the
software per se.
Relation to Other Projects
GEMS
Plan of Work
System Deployment
Recommender
System
Validation and
Confirmation
Software
Management
2006
Software Kernel
Simple Portal
Static w/ Kernel
2007
Remote Software
Deployment
Manual
Additions
2008
Automatic Config
of New Nodes
Automatic
Upgrade/Backout
2009
Resource Sharing
Integrate with
Software Eng.
Methods
2010
Personnel
Dr. Izaguirre
Dr. Thain
Postdoc
Programmer
Grad 1 –
Grad 2 –
Grad 3 –
Users and
Applications
Download