MDSimAid: A Recommender for Optimizing Electrostatics Algorithms for Molecular Dynamics Abstract Molecular dynamics (MD) for modeling the behavior of biological molecules is an important but difficult task. One will need to understand the theories in molecular biology or biochemistry, the numerical methods involved as well as the manipulations of standard MD packages. We present a concrete example, a program written in Python called MDSimAid, that helps users to prepare molecular systems for MD simulations with the goal of achieving the highest accuracy in the shortest time possible. MDSimAid incorporates ideas from recommender systems and adaptive software to simplify the tasks of setting up a computer model for MD and for selecting the optimal algorithms and parameter sets based on real time analysis and data mining techniques. MDSimAid prepares files for input into CHARMM, a standard MD package, and interacts with ProtoMol, an object-oriented framework for MD, to select the best algorithm to handle full electrostatics and recommends the optimal parameter set for the algorithm to attain the desired accuracy with the fastest algorithm possible. Introduction Molecular modeling has become a popular field and researchers want to take advantage of technology and computer power to answer more complex questions, such as protein motions, instead of using conventional methods, such as nuclear magnetic methods (NMR) and X-ray crystallography, which only provide instances of static views. However, the process of performing molecular simulations can be complicated and often requires time and experience to master the techniques. Often scientists need to learn to work with the MD package and understand the keywords, structure and format of the configuration files, and all other minor but necessary details. This can be a daunting task especially when most documentations are usually difficult to navigate, unfriendly for searching, and full of domain specific terminology. With the growing need of solving the N-body problem in MD at a better time complexity than O(N2), computer scientists and numerical analysts have begun to develop better but unfortunately more complicated algorithms, such as Ewald [], particlemesh Ewald (PME) [], and multigrid (MG) []. A thorough understanding of these algorithms and the relationships among all the parameters can be obtained through time and experience and the knowledge is very important in selecting the algorithm and the parameters. The accuracy and time consumption of simulations are greatly depended on the choice of algorithms and parameter sets. This certainly presents a challenge to nonexperts while trial and error method to guess the parameter sets can be a time-consuming process. Therefore, a recommender system will be helpful in this field so that users can bypass the time-consuming preparation process and concentrate on the results of their simulations and further their research. Most recommender systems that have been developed are for selecting the best algorithms to solve a particular mathematical problem. The focus tends to fall on finding the most effective way to perform numerical analysis with homogenous or heterogeneous parallel computing environments. However, there have not been any recommender systems developed for the field of molecular modeling. As a result, we decided to create MDSimAid to provide users a friendly interface to MD so that they no longer have to generate their configuration files to prepare their molecular systems and to select the best algorithm with optimal parameter values so that users do not have to manipulate the different combination of the parameter sets. MDSimAid (i) helps setup the molecular systems with minimal users’ effort, (ii) analyzes performance data of available algorithms and parameter sets, (iii) recommends the best algorithm wit optimal parameter values for the evaluation of electrostatics, and (iv) accesses the recommendation provided to attain desired accuracy and minimize CPU time requirement for the target simulations. The Problem The main idea for developing MDSimAid is to choose the best algorithm with the optimal parameter set for users to handle full electrostatic interactions. The evaluation of full electrostatic forces is an important aspect in MD. These forces have long range effect and are critical in understanding molecular behavior. This is particularly true because most biological molecules, such as proteins, DNA and membranes, are charged systems and are naturally dominated by long range interactions. These long-range interactions play important roles in the stability \cite{YaHo92} and functionality of biomolecules \cite{GiHo87}. They are also one of the dominant factors in determining the conformations of proteins \cite{Shor92}. Without proper handling of electrostatic forces, structures may begin to fall apart and hence the proteins will lose their functionalities. Therefore, it is important to consider the full electrostatic forces during simulations in order to achieve stable trajectories. However, simulating electrostatic interactions is difficult: it is computationally intensive and leads to computational bottlenecks. It is even more difficult to handle in periodic boundary conditions and it is unfavorable in scaling as the number of atoms increases. With the simplest implementation, the computation time will become O(N2). With more complicated algorithms, altering a certain parameter can change the outcome of the computations dramatically. The general trends of how CPU time and accuracy are affected by varying some of the parameters in Ewald, PME and MG methods have been reviewed in literature \cite{Pete95, DeHo98, STHa02}. However, most of the analysis have failed to suggest a way to determine the specific values to be used for any particular simulation system. They often only supply some general equations with unknown constant values to show the relationships of the parameters. The development of those equations and the identification of the constants are usually implementation dependent. Hence, without the actual code on which the analysis was performed, it may be difficult to repeat the analysis or make use of the information. Understanding the trends of how varying a parameter can affect a simulation is important, but the most important and practical issue for users doing molecular simulation is the optimal numeric value they should use for their simulations. Moreover, each method can have multiple adjustable parameters that can affect the performance of the method. It is even more difficult to find the optimal combination of all the parameters for a given method. Determining the values through simple trial and error method can be formidable and inefficient. Therefore, MDSimAid is designed to compare the methods for evaluating fast electrostatic forces and find the optimal parameters by tuning the values. This will provide a real time analysis and a good estimate of the time required by the simulations based on the actual molecular system and the computer architecture on which the molecular dynamics is performed. The Overview of MDSimAid MDSimAid is written in Python. The reason why Python was chosen is that Python is not only a scripting language, it is also an object-oriented language. It has advanced data structures that are not supported by other scripting languages and it supports multiple programming paradigms such as procedural and modular paradigms. This gives programs the flexibility to be extended and allows modules to be added to or removed from a program without the need to recompile or re-link the entire program. At the same time, Python is platform independent and the syntax is pseudo-code like with high readability and is easy to understand even for beginners in programming. There have also been a lot of different programs written in Python that are related to molecular dynamics and are of similar nature to MDSimAid. For example, the Python Molecule Viewer (PMV) \cite{Sann99} developed at the Scripps Research Institute is an attempt to integrate computation and visualization for molecular simulations \cite{SDCO98}. Although MDSimAid does not handle visualization, both programs are similar in the sense that they act as the user interface and handle the users' commands and other I/O while passing the information to other external programs for other functions. MDSimAid handles parameters optimization while it calls CHARMM \cite{BBOS83} to prepare the molecules and \ProtoMol \cite{MaIz01} to carry out the simulation; PMV handles molecule representation while it uses Amber \cite{BeCa89} and AutoDock \cite{MGHH98} to perform molecular dynamics and docking calculations respectively but it does not carry out parameter and algorithm optimization. \begin{figure} \centerline{\includegraphics[width=8cm]{mdsimaidgui.eps}} \caption{A snapshot of MDSimAid} \label{fig:gui} \end{figure} Our approach in designing MDSimAid is an combination of ideas from adaptive software systems and the nature of recommender systems. Examples of adaptive software include SALSA and ATLAS. SALSA is ... ATLAS, which stands for Automatically Tuned Linear Algebra Software \cite{WhPD00}, makes use of the fact that any given operation can typically be performed in many ways, it automatically optimizes linear algebra routines available on a given computer architecture. It uses empirical timings in order to choose the best method for the architecture and thus it can adapt to a new computer architecture in a matter of hours, rather than requiring months or even years of experts' time, as it would normally required by following traditional methods. An example of recommender systems is PYTHIA \cite{WHRJ96}, which is a knowledge based system that selects an optimal software and hardware combination to numerically solve a partial differential equation (PDE) problem under the accuracy and time constraints imposed by the user. With the characteristics of adaptive software and recommender systems, MDSimAid starts with gathering information from its users and it compares the information to its knowledge or rules to recommend an initial parameter set. It then automatically adjusts parameters to tune the algorithms at run time so that the molecular simulations will run more efficiently. Furthermore, MDSimAid is intended to be an user friendly interface to enable setting up molecular simulations as simple as possible. Users will only need to do a few clicks and input minimal information to begin the process of preparing the files for simulation and searching for the optimal method and parameters. This is especially helpful to beginners in molecular dynamics and eliminates the need for users to create or edit configuration files conforming to the appropriate format for molecular simulations. A snapshot of the graphical user interface of MDSimAid is shown in figure \ref{fig:gui}. \begin{figure}[h] \centerline{\includegraphics[width=8cm]{mdsimaid.eps}} \caption{The design of MDSimAid} \label{fig:mdsimaid} \end{figure} The basic algorithm of MDSimAid follows the generic simulation protocol as outlined in figure~\ref{fig:mdsimaid}. It mainly uses \ProtoMol to accomplish its tasks in choosing the optimal method and parameters while supplemented with some functions from CHARMM that are not incorporated in \ProtoMol to prepare the molecular systems. MDSimAid formats the files obtained by the users from the Protein Data Bank (PDB), a worldwide repository for 3-D structures of biological molecules, in order to make the files compatible with CHARMM. After reading the residue topology file (information that describes the properties of each amino acid residues, nucleotides and solvent molecules), parameter file (the numeric values needed to the generate the geometries of the molecules described), the protein sequence, coordinates of the atoms and other parameters, MDSimAid will build the protein using CHARMM by adding the missing atoms according to the topology file and generate the corresponding Protein Structure File (PSF). PSF is specific to each protein and it contains every bond, bond angle, torsion angle, and improper torsion angle as well as information needed to represent the connectivities of the atoms in the protein molecule. The entire protein structure will have to be created one segment at a time and with all the segments combined, it will give the entire structure. With the initial coordinates from the PDB file and protein structure from the PSF, MDSimAid will continue to use CHARMM to follow the generic protocol to minimize the energy of the protein molecules, to heat the system to the desired temperature and to equilibrate the system in order to relief any distorting force. The adjustment of the temperature will be done by scaling the velocities of the atoms accordingly. Once the system is equilibrated, MDSimAid will use \ProtoMol based on the boundary condition and the accuracy desired by the users to choose the method requiring the shortest time for evaluating fast electrostatic forces and yet achieving the target accuracy. The different methods implemented in \ProtoMol for periodic boundary condition include the Ewald method, the PME method and the MG method; the methods for vacuum includes the direct method, the cutoff method and the MG method. However, the cutoff method is not being considered because of its inaccuracy in computing forces and energies and inability to represent realistic molecular behavior. The Empirical Studies A series of TIP3P water models [] with the number of atoms, N, ranging from 10 to 106 are used to test the performance of the algorithms. The results are used to build the rules in guiding MDSimAid to choose the optimal parameters for running molecular simulations using ProtoMol. (Water systems are chosen because a large amount of research has already been done on simulations of water molecules. This allow us to compare our results to the published statistics.) Based on the boundary condition, different configuration files for ProtoMol comparing different evaluation methods of electrostatic forces are generated. For no boundary condition (i.e. in vacuum), the relative error of MG is computed based on the evaluation of the Direct method. For periodic boundary conditions, Ewald is set to compute with the highest accuracy allowable with the implementation in ProtoMol and the result is compared to PME and MG to find their corresponding relative error in evaluating the potential energy. 3 The Determination of the parameters PME Method From the results of varying the interpolation order, cutoff distance and grid size when using PME, the general relationships between these parameters and N for ProtoMol have been reviewed. Based on published results [], the B-spline interpolation function is used and interpolation order 4, 6 and 8 are tested for all cases. The timing results show that as the interpolation order increases, the CPU time per MD step also increases. However, contrary to the results from Essmann et al. in [], our relative error measurements show that higher interpolation order does not seem to have a significant effect on the relative error. Therefore, it will not be wise to choose a high interpolation order when the cost of CPU time cannot be compensated. As for the cutoff distance, it is adjusted so that the corresponding β calculated internally by ProtoMol based on the desired accuracy and the system size N is similar to the β value in Ewald in order to achieve the target accuracy. The relationship between the cutoff distance and N is shown in Figure . Together with the above variations, three different grid sizes with spacing of 0.5Å, 1.0Å and 2.0Å are also used for testing. It is found that 1.0Å and 2.0Å spacing only have minute differences in timing and relative error, but 0.5Å spacing requires a longer time yet returns the similar relative error when compared to results using the other two grid sizes. Furthermore, the cost in CPU time in achieving higher accuracy is not as significant as in other methods. Figure shows that reducing the relative error from 104 to 106 does not require a substantial increase in time. Therefore, PME seems to be an attractive choice of method for simulations in periodic boundary conditions. Multigrid Method For MG, different combinations of cutoff distance, grid size and number of levels of grids are used for testing. A set of cutoff distances, 6, 8, 10 and 12, and four different levels of grids are chosen. Since the MG method is designed to use approximation to save time, it is not logical to use MG for evaluation when high accuracy is desired. Therefore, in order to take the full advantage of using MG, the goal is to find the optimal set of parameters that will still achieve a relative error of 104. It is expected that increasing cutoff distance or grid size will increase the CPU time required, as is confirmed with our simulation results. For example, in Figure with 80000 atoms, as the cutoff distance increases from 8Å to 10Å to 12Å and as the finest grid size increases from 24 x 24 x 24 Å3 to 48 x 48 x 48 Å3, the time per MD step increases regardless of the number of levels used. But this is not the case when the relative error is measured. The relative error does not necessarily decrease for all N as the cutoff distance increases. An example is shown in Figure . When the finest grid size is set to 24, the cutoff distance at 8Å shows the best accuracy, followed by 12Å and then 10Å at two levels. But at three levels, the order of accuracy becomes as what one would expect with 12Å showing the best and 8Å being the least among them. Moreover, when the finest grid size is increased to 48, the relative error increases. Therefore, it is only beneficial to increase the cutoff distance or grid size if the gain in accuracy is actually better than the cost of CPU time. Because of the complexity of the inter-relationships among the parameters in MG, there is not a clear picture on how the number of levels used in evaluation will affect accuracy and time. It can behave differently depending on the choice of the cutoff distance and grid size used for the simulation as discussed above. Therefore, it is advantageous to have a tool like MDSimAid that can tune the parameters based on real time analysis. The Performance Evaluation Results The MG method, being an O(N) algorithm, definitely shows better timing than direct method for all N in vacuum (Figure ). As for simulations with periodic boundary conditions, both MG and PME can perform better than Ewald for all cases tested. This result contradicts the analysis done by Petersen [] in which he shows that there exists a critical number N* such that Ewald will be faster than PME for atom numbers N < N*. The disagreement may be accounted for by the differences in implementation and different methods of CPU time measurements. After searching for the optimal combination of parameters for PME and MG in periodic boundary conditions, the results in CPU time and relative error measurements produce Figure which shows that MG performs better to PME for all N < 106 with moderate accuracy (104 relative error at best), but it is only superior than PME for systems of roughly 6000 or more atoms and when higher accuracy (105 relative error at best) is required (Figure ).