MDSimAid: A Recommender for Optimizing Electrostatic Algorithms

advertisement
MDSimAid: A Recommender for Optimizing
Electrostatics Algorithms for Molecular Dynamics
Abstract
Molecular dynamics (MD) for modeling the behavior of biological molecules is
an important but difficult task. One will need to understand the theories in molecular
biology or biochemistry, the numerical methods involved as well as the manipulations of
standard MD packages. We present a concrete example, a program written in Python
called MDSimAid, that helps users to prepare molecular systems for MD simulations
with the goal of achieving the highest accuracy in the shortest time possible. MDSimAid
incorporates ideas from recommender systems and adaptive software to simplify the tasks
of setting up a computer model for MD and for selecting the optimal algorithms and
parameter sets based on real time analysis and data mining techniques. MDSimAid
prepares files for input into CHARMM, a standard MD package, and interacts with
ProtoMol, an object-oriented framework for MD, to select the best algorithm to handle
full electrostatics and recommends the optimal parameter set for the algorithm to attain
the desired accuracy with the fastest algorithm possible.
Introduction
Molecular modeling has become a popular field and researchers want to take
advantage of technology and computer power to answer more complex questions, such as
protein motions, instead of using conventional methods, such as nuclear magnetic
methods (NMR) and X-ray crystallography, which only provide instances of static views.
However, the process of performing molecular simulations can be complicated and often
requires time and experience to master the techniques. Often scientists need to learn to
work with the MD package and understand the keywords, structure and format of the
configuration files, and all other minor but necessary details. This can be a daunting task
especially when most documentations are usually difficult to navigate, unfriendly for
searching, and full of domain specific terminology.
With the growing need of solving the N-body problem in MD at a better time
complexity than O(N2), computer scientists and numerical analysts have begun to
develop better but unfortunately more complicated algorithms, such as Ewald [], particlemesh Ewald (PME) [], and multigrid (MG) []. A thorough understanding of these
algorithms and the relationships among all the parameters can be obtained through time
and experience and the knowledge is very important in selecting the algorithm and the
parameters. The accuracy and time consumption of simulations are greatly depended on
the choice of algorithms and parameter sets. This certainly presents a challenge to nonexperts while trial and error method to guess the parameter sets can be a time-consuming
process.
Therefore, a recommender system will be helpful in this field so that users can
bypass the time-consuming preparation process and concentrate on the results of their
simulations and further their research. Most recommender systems that have been
developed are for selecting the best algorithms to solve a particular mathematical
problem. The focus tends to fall on finding the most effective way to perform numerical
analysis with homogenous or heterogeneous parallel computing environments. However,
there have not been any recommender systems developed for the field of molecular
modeling. As a result, we decided to create MDSimAid to provide users a friendly
interface to MD so that they no longer have to generate their configuration files to
prepare their molecular systems and to select the best algorithm with optimal parameter
values so that users do not have to manipulate the different combination of the parameter
sets. MDSimAid (i) helps setup the molecular systems with minimal users’ effort, (ii)
analyzes performance data of available algorithms and parameter sets, (iii) recommends
the best algorithm wit optimal parameter values for the evaluation of electrostatics, and
(iv) accesses the recommendation provided to attain desired accuracy and minimize CPU
time requirement for the target simulations.
The Problem
The main idea for developing MDSimAid is to choose the best algorithm with the
optimal parameter set for users to handle full electrostatic interactions. The evaluation of
full electrostatic forces is an important aspect in MD. These forces have long range
effect and are critical in understanding molecular behavior. This is particularly true
because most biological molecules, such as proteins, DNA and membranes, are charged
systems and are naturally dominated by long range interactions. These long-range
interactions play important roles in the stability \cite{YaHo92} and functionality of
biomolecules \cite{GiHo87}. They are also one of the dominant factors in determining
the conformations of proteins \cite{Shor92}. Without proper handling of electrostatic
forces, structures may begin to fall apart and hence the proteins will lose their
functionalities. Therefore, it is important to consider the full electrostatic forces during
simulations in order to achieve stable trajectories. However, simulating electrostatic
interactions is difficult: it is computationally intensive and leads to computational
bottlenecks. It is even more difficult to handle in periodic boundary conditions and it is
unfavorable in scaling as the number of atoms increases. With the simplest
implementation, the computation time will become O(N2).
With more complicated algorithms, altering a certain parameter can change the
outcome of the computations dramatically. The general trends of how CPU time and
accuracy are affected by varying some of the parameters in Ewald, PME and MG
methods have been reviewed in literature \cite{Pete95, DeHo98, STHa02}. However,
most of the analysis have failed to suggest a way to determine the specific values to be
used for any particular simulation system. They often only supply some general equations
with unknown constant values to show the relationships of the parameters. The
development of those equations and the identification of the constants are usually
implementation dependent. Hence, without the actual code on which the analysis was
performed, it may be difficult to repeat the analysis or make use of the information.
Understanding the trends of how varying a parameter can affect a simulation is important,
but the most important and practical issue for users doing molecular simulation is the
optimal numeric value they should use for their simulations.
Moreover, each method can have multiple adjustable parameters that can affect
the performance of the method. It is even more difficult to find the optimal combination
of all the parameters for a given method. Determining the values through simple trial and
error method can be formidable and inefficient. Therefore, MDSimAid is designed to
compare the methods for evaluating fast electrostatic forces and find the optimal
parameters by tuning the values. This will provide a real time analysis and a good
estimate of the time required by the simulations based on the actual molecular system and
the computer architecture on which the molecular dynamics is performed.
The Overview of MDSimAid
MDSimAid is written in Python. The reason why Python was chosen is that
Python is not only a scripting language, it is also an object-oriented language. It has
advanced data structures that are not supported by other scripting languages and it
supports multiple programming paradigms such as procedural and modular paradigms.
This gives programs the flexibility to be extended and allows modules to be added to or
removed from a program without the need to recompile or re-link the entire program. At
the same time, Python is platform independent and the syntax is pseudo-code like with
high readability and is easy to understand even for beginners in programming. There
have also been a lot of different programs written in Python that are related to molecular
dynamics and are of similar nature to MDSimAid. For example, the Python Molecule
Viewer (PMV) \cite{Sann99} developed at the Scripps Research Institute is an attempt to
integrate computation and visualization for molecular simulations \cite{SDCO98}.
Although MDSimAid does not handle visualization, both programs are similar in the
sense that they act as the user interface and handle the users' commands and other I/O
while passing the information to other external programs for other functions. MDSimAid
handles parameters optimization while it calls CHARMM \cite{BBOS83} to prepare the
molecules and \ProtoMol \cite{MaIz01} to carry out the simulation; PMV handles
molecule representation while it uses Amber \cite{BeCa89} and AutoDock
\cite{MGHH98} to perform molecular dynamics and docking calculations respectively
but it does not carry out parameter and algorithm optimization.
\begin{figure}
\centerline{\includegraphics[width=8cm]{mdsimaidgui.eps}}
\caption{A snapshot of MDSimAid}
\label{fig:gui}
\end{figure}
Our approach in designing MDSimAid is an combination of ideas from adaptive
software systems and the nature of recommender systems. Examples of adaptive
software include SALSA and ATLAS. SALSA is ...
ATLAS, which stands for Automatically Tuned Linear Algebra Software
\cite{WhPD00}, makes use of the fact that any given operation can typically be
performed in many ways, it automatically optimizes linear algebra routines available on a
given computer architecture. It uses empirical timings in order to choose the best method
for the architecture and thus it can adapt to a new computer architecture in a matter of
hours, rather than requiring months or even years of experts' time, as it would normally
required by following traditional methods. An example of recommender systems is
PYTHIA \cite{WHRJ96}, which is a knowledge based system that selects an optimal
software and hardware combination to numerically solve a partial differential equation
(PDE) problem under the accuracy and time constraints imposed by the user.
With the characteristics of adaptive software and recommender systems,
MDSimAid starts with gathering information from its users and it compares the
information to its knowledge or rules to recommend an initial parameter set. It then
automatically adjusts parameters to tune the algorithms at run time so that the molecular
simulations will run more efficiently. Furthermore, MDSimAid is intended to be an user
friendly interface to enable setting up molecular simulations as simple as possible. Users
will only need to do a few clicks and input minimal information to begin the process of
preparing the files for simulation and searching for the optimal method and parameters.
This is especially helpful to beginners in molecular dynamics and eliminates the need for
users to create or edit configuration files conforming to the appropriate format for
molecular simulations. A snapshot of the graphical user interface of MDSimAid is
shown in figure \ref{fig:gui}.
\begin{figure}[h]
\centerline{\includegraphics[width=8cm]{mdsimaid.eps}}
\caption{The design of MDSimAid}
\label{fig:mdsimaid}
\end{figure}
The basic algorithm of MDSimAid follows the generic simulation protocol as
outlined in figure~\ref{fig:mdsimaid}. It mainly uses \ProtoMol to accomplish its tasks
in choosing the optimal method and parameters while supplemented with some functions
from CHARMM that are not incorporated in \ProtoMol to prepare the molecular systems.
MDSimAid formats the files obtained by the users from the Protein Data Bank (PDB), a
worldwide repository for 3-D structures of biological molecules, in order to make the
files compatible with CHARMM. After reading the residue topology file (information
that describes the properties of each amino acid residues, nucleotides and solvent
molecules), parameter file (the numeric values needed to the generate the geometries of
the molecules described), the protein sequence, coordinates of the atoms and other
parameters, MDSimAid will build the protein using CHARMM by adding the missing
atoms according to the topology file and generate the corresponding Protein Structure
File (PSF). PSF is specific to each protein and it contains every bond, bond angle,
torsion angle, and improper torsion angle as well as information needed to represent the
connectivities of the atoms in the protein molecule. The entire protein structure will have
to be created one segment at a time and with all the segments combined, it will give the
entire structure.
With the initial coordinates from the PDB file and protein structure from the PSF,
MDSimAid will continue to use CHARMM to follow the generic protocol to minimize
the energy of the protein molecules, to heat the system to the desired temperature and to
equilibrate the system in order to relief any distorting force. The adjustment of the
temperature will be done by scaling the velocities of the atoms accordingly. Once the
system is equilibrated, MDSimAid will use \ProtoMol based on the boundary condition
and the accuracy desired by the users to choose the method requiring the shortest time for
evaluating fast electrostatic forces and yet achieving the target accuracy. The different
methods implemented in \ProtoMol for periodic boundary condition include the Ewald
method, the PME method and the MG method; the methods for vacuum includes the
direct method, the cutoff method and the MG method. However, the cutoff method is not
being considered because of its inaccuracy in computing forces and energies and inability
to represent realistic molecular behavior.
The Empirical Studies
A series of TIP3P water models [] with the number of atoms, N, ranging from
10 to 106 are used to test the performance of the algorithms. The results are used to
build the rules in guiding MDSimAid to choose the optimal parameters for running
molecular simulations using ProtoMol. (Water systems are chosen because a large
amount of research has already been done on simulations of water molecules. This allow
us to compare our results to the published statistics.) Based on the boundary condition,
different configuration files for ProtoMol comparing different evaluation methods
of electrostatic forces are generated. For no boundary condition (i.e. in vacuum),
the relative error of MG is computed based on the evaluation of the Direct method.
For periodic boundary conditions, Ewald is set to compute with the highest accuracy
allowable with the implementation in ProtoMol and the result is compared to
PME and MG to find their corresponding relative error in evaluating the potential
energy.
3
The Determination of the parameters
PME Method
From the results of varying the interpolation order, cutoff distance and grid size
when using PME, the general relationships between these parameters and N for ProtoMol
have been reviewed. Based on published results [], the B-spline interpolation function is
used and interpolation order 4, 6 and 8 are tested for all cases. The timing results show
that as the interpolation order increases, the CPU time per MD step also increases.
However, contrary to the results from Essmann et al. in [], our relative error
measurements show that higher interpolation order does not seem to have a significant
effect on the relative error. Therefore, it will not be wise to choose a high interpolation
order when the cost of CPU time cannot be compensated.
As for the cutoff distance, it is adjusted so that the corresponding β calculated internally
by ProtoMol based on the desired accuracy and the system size N is similar to the β value
in Ewald in order to achieve the target accuracy. The relationship between the cutoff
distance and N is shown in Figure . Together with the above variations, three different
grid sizes with spacing of 0.5Å, 1.0Å and 2.0Å are also used for testing. It is found that
1.0Å and 2.0Å spacing only have minute differences in timing and relative error, but
0.5Å spacing requires a longer time yet returns the similar relative error when compared
to results using the other two grid sizes. Furthermore, the cost in CPU time in achieving
higher accuracy is not as significant as in other methods. Figure shows that
reducing the relative error from 104 to 106 does not require a substantial increase
in time. Therefore, PME seems to be an attractive choice of method for simulations
in periodic boundary conditions.
Multigrid Method
For MG, different combinations of cutoff distance, grid size and number of levels
of grids are used for testing. A set of cutoff distances, 6, 8, 10 and 12, and four different
levels of grids are chosen. Since the MG method is designed to use approximation to
save time, it is not logical to use MG for evaluation when high accuracy is desired.
Therefore, in order to take the full advantage of using MG, the goal is to find the optimal
set of parameters that will still achieve a relative error of 104.
It is expected that increasing cutoff distance or grid size will increase the CPU
time required, as is confirmed with our simulation results. For example, in Figure with
80000 atoms, as the cutoff distance increases from 8Å to 10Å to 12Å and as the finest
grid size increases from 24 x 24 x 24 Å3 to 48 x 48 x 48 Å3, the time per MD step
increases regardless of the number of levels used. But this is not the case when the
relative error is measured. The relative error does not necessarily decrease for all N as the
cutoff distance increases. An example is shown in Figure . When the finest grid size is
set to 24, the cutoff distance at 8Å shows the best accuracy, followed by 12Å and then
10Å at two levels. But at three levels, the order of accuracy becomes as what one would
expect with 12Å showing the best and 8Å being the least among them. Moreover, when
the finest grid size is increased to 48, the relative error increases. Therefore, it is only
beneficial to increase the cutoff distance or grid size if the gain in accuracy is actually
better than the cost of CPU time.
Because of the complexity of the inter-relationships among the parameters in MG,
there is not a clear picture on how the number of levels used in evaluation will affect
accuracy and time. It can behave differently depending on the choice of the cutoff
distance and grid size used for the simulation as discussed above. Therefore, it is
advantageous to have a tool like MDSimAid that can tune the parameters based on real
time analysis.
The Performance Evaluation Results
The MG method, being an O(N) algorithm, definitely shows better timing than
direct method for all N in vacuum (Figure ). As for simulations with periodic boundary
conditions, both MG and PME can perform better than Ewald for all cases tested. This
result contradicts the analysis done by Petersen [] in which he shows that there exists a
critical number N* such that Ewald will be faster than PME for atom numbers N < N*.
The disagreement may be accounted for by the differences in implementation and
different methods of CPU time measurements.
After searching for the optimal combination of parameters for PME and MG
in periodic boundary conditions, the results in CPU time and relative error measurements
produce Figure which shows that MG performs better to PME for all N < 106 with
moderate accuracy (104 relative error at best), but it is only superior than PME for
systems of roughly 6000 or more atoms and when higher accuracy (105 relative error at
best) is required (Figure ).
Download