Optimization of Docking Conformations Using Grid Datafarm METI Project Report March 2003 Junwei Cao and Takumi Washio C&C Research Laboratories, NEC Europe Ltd., Germany {cao, washio}@ccrl-nece.de 1. Introduction Grid computing originated from a new computing infrastructure for scientific research and cooperation and is becoming a mainstream technology for large-scale resource sharing and distributed application integration. Data access and management is one of the key services that must be provided by the grid infrastructure and related topics include data transfer, replica, visualization and so on. Grid Datafarm (GFarm) is a Japanese national project that aims to design an infrastructure for global petascale data intensive computing. GFarm tools and APIs are provided to handle large data files in both single filesystem image and local file views. While the Grid Datafarm is originally motivated by high energy physics applications, it is a generic distributed I/O management and scheduling infrastructure that can be applied to support other data grid applications. One of the biomedical simulation applications developed in the NEC for the docking conformation optimization is investigated to use the GFarm infrastructure to handle large-scale simulation results. The result files of the application contain detailed conformation information and corresponding energy values, which potentially become very large if the considered problem size and simulation iterations increase. In this report, two cases are considered with the second one more realistically implemented. This report documents some initially implemented tools, which are based on existing GFarm tools and APIs and can be provided to users with application specific interfaces. Docking confirmation result files can be handled and some key features of the GFarm environment are validated. In the second case, tools are evaluated in a mini scale and some performance results are also included. Some suggestions are also provided for future improvement of the GFarm toolkit. 2. Grid Datafarm Architecture The Grid Datafarm architecture couples the disk storage capabilities together with the high performance computing processors to maximize the disk I/O advantages and avoid the bandwidth limitation of the global network in a grid environment. A GFarm metaserver with an LDAP-based metadata DB is utilized for management and information services to achieve a single file system. One of the key features of the GFarm infrastructure is the support of both global and local file views. The global view eases the file handling, and the local view can be used to achieve the scalability of the file manipulation. These are both used in the GFarm application specific tools introduced later. GFarm Tools and APIs Grid Datafarm implementation provides a series of tools and programming APIs. Apart from some general file operation commands (e.g. gfls, gfwhere, gfreg, etc.), several example tools that are included in the GFarm release is found to be useful especially when applied for the docking optimization application. These include: gfimport_fixed The command gfimport can be used to import a file into the GFarm infrastructure by dividing the file into same-sized fragments. If users want to define fixed-size blocks in the file and avoid divisions in the middle of blocks, the gfimport_fixed command can be used. And the file will be divided into almost-same-size fragments. gfimport_text If the considered file is a text file and the user want to avoid divisions within lines when it is imported into the GFarm infrastructure, the tool gfimport_text can be used. And the file will be divided into almost-same-size fragments. gfmpirun The command gfmpirun is used to run MPI programs that handle GFarm files when the datafarm infrastructure is installed in a small-scale cluster computing environment. The gfmpirun firstly looks up the locations of the corresponding file fragments and then transfers multiple copies of the MPI programs onto these file nodes. Current gfmpirun makes system call mpirun and relies on the specific MPI implementation. This has to be updated according to the configurations of the NEC MPI implementation that is described below. GFarm and NEC MPI The implementation of the NEC MPI is especially coupled with a cluster scheduling system, COSY. When the user requires an mpirun, the COSY scheduling will issue a resource allocation to the user automatically. But the best way to guarantee the required resources is to reserve the resources via COSY in advance and use an additional reservation ID jid for mpirun. The gfmpirun is updated to include these specific options of the NEC MPI and COSY implementation. 3. Optimization of Docking Conformations Overview This optimization code ultimately aims at finding minimal energy conformation for given pairs of proteins and ligands (molecules which become medicines). Such a tool can be used for screening candidates of medicine by computer simulation before doing actual experiments. By doing this screening process fast and accurately, the development cycle of medicines can be dramatically reduced. For this type of minimization problem, there are a quite large number of local minima. Thus, the many optimizations with different initial conformations must take place. Optimization processes for different initial conformations can be performed independently from others. Thus, one can run many optimization processes in parallel on distributed computers and can save the results of local minimum conformations in distributed storage devices. After all processes are completed, one would want to make ranking for all produced data, and then pick up the top energy conformations (from the minimum) at his/her own computer. For such file manipulations, Gfarm can be used. The development of the optimization code has been almost finished. The optimization code is independent from the energy and force calculation routines. Namely, users can easily couple their molecular mechanics codes with the optimization code by giving energies and forces for conformations provided by the optimization code. This mechanism is described in Fig. 1. Call opt(init(=0),iproc,x,ene,f,…,res) iproc? 1 Convergence criterion based on res 0 Compute enegy:ene and force:f yes stop no Call opt(init(=1),iproc,x,ene,f,…,res) end Fig. 1. Flowchart of the optimization code In most cases, bond lengths and bond angles of molecules are stiff. Thus, in order to accelerate the convergence of the optimization process, one wants to allow movements only for torsion angles. Such optimization is possible with this code. Actually, users can specify places where they want to allow stretching, bending or twisting as they like. In the test case, eight identical organic compounds (consisting of 37 atoms) are docked from randomly generated initial conformations (See Fig. 2). In Fig. 3, the energies are plotted for 1000 optimized conformations. As can be shown in the figure, there are many local minima, thus many local optimizations must take place to obtain a nearly best conformation in terms of energy. And such the optimizations will be done for many ligands. Thus, the amount of data will grow up for realistic situations. The application result files are currently handled manually. After the optimization is finished, the user needs to look into the energy values and find corresponding item index numbers with minimum energy. The conformation information is then picked up according to the index number and saved into a separate file that can be taken as an input file of visualization tools. This work aims to provide tools on top of the GFarm infrastructure and facilitate these processes. Fig. 2. Initial conformation and after the optimization Fig. 3. Energy plots for 1000 optimization processes Data Formats When the optimization is finished, two text files are produced: recording of all the detailed information on docking conformations and corresponding energy values. The formats of the file data are illustrated in Fig. 4. The way GFarm should be used to handle these files is related to these formats. REMARK Conf. indx: ATOM 1 cme DOC ATOM 2 cr DOC …… ATOM 37 h DOC ATOM 38 cme DOC ATOM 39 cr DOC …… ATOM 295 h DOC ATOM 296 h DOC 1 1 1 3.806 3.041 -0.121 0.857 0.166 1.077 1 2 2 3.321 12.736 12.919 -1.097 2.642 1.920 0.139 3.256 1.909 8 8 17.502 12.409 14.129 17.110 14.077 15.873 REMARK Conf. indx: ATOM 1 cme DOC …… REMARK Conf. indx: ATOM 1 cme DOC …… ATOM 296 h DOC 2 1 3.325 -0.862 -0.727 100 1 -0.143 -2.234 -0.464 8 14.197 13.748 11.206 1 270.474215588489 8.23462884877584 32.8818063762824 118.212317614894 -2.33626143043164 -6.730923029215533E-002 2 271.679717003881 8.23462884877540 32.8970978819788 117.755440564651 -0.927292643390787 1.910435142524919E-002 3 268.071669028748 8.23462884877595 32.8414811503397 119.196523738672 -5.16278691917328 -0.326044511801722 …… 99 268.558249731899 8.23462884877596 32.8601637981316 118.801768872776 -4.57512654646738 -0.101779209448476 100 267.857273622203 8.23462884877486 32.8490472810645 118.440695791104 -4.79704329772594 -0.351354469096432 Fig. 4. Example result files In Fig. 4, both of the files contain data of 100 items which are generated by the optimization processes with randomly made initial conformations. Each line of one item in the first file describes docking conformation information of one of the atoms of a certain dock. The size of each line is fixed, therefore the total size of one item is determined by the total number of atoms involved in the application. In Fig. 4, docking of 8 molecules are considered and all molecules are chemically identical, and each of them is composed of 37 atoms, so the total number of atoms is 296. The second file includes the energy values of each docking. The first column is the index number and the second is the total energy value which is the most important and used for ranking of the docking conformations. Each line is corresponded to one item in the first file and the size of each line is not fixed. The two files potentially become quite large (especially the first file) if more initial conformations are optimized (instead of 100 in Fig. 4) and molecules are potentially composed of more atoms (say, thousands of atoms). When the GFarm is used to handle these files, it is required to avoid fragment divisions within one item of data. GFarm Tools gfopt_import This is used to import the docking conformation result file into a GFarm file. The number of atoms in the application should be input so that the tool can calculate the fixed size of each block automatically and utilize the gfimport_fixed to avoid breaks in the middle of blocks. gfopt_import_energies This is used to import the energy result file into a GFarm file. The gfimport_text is utilised. However, instead of the almost-same-size file division, the almost-same-line fragment division strategy is utilized so that the division can be carried out in the same way that the gfopt_import tool does. gfopt_export_energies This is actually a MPI program that takes a GFarm energy file as input. The program reads through the energy results in a parallel way and tries to find the index number with minimum total energy value. This is executed using the updated gfmpirun and the output index numbers can be used as inputs of gfopt_export. gfopt_export The gfopt_export can be used to retrieve a specific item of data from a GFarm docking conformation result file, given the index number. The tool can calculate the location of the file fragment where the corresponding block of data is stored, switch to the GFarm local file view, seek to the beginning position (using the gfs_pio_seek API) and read out a fixed-size of data. The output can be saved in an individual file that can be used directly by visualization tools. The process benefits from the GFarm support of local file views. Since only one of the file fragments is opened, the performance of the operation will not decrease much when the file size increases. 4. Screening with a Docking Optimization Code Overview The docking simulation code ultimately aims at finding ligands which dock tightly with a targeted protein. Such ligands are selected from libraries of chemical compounds which include a huge number of candidates. The main engine of our docking code is composed of two parts: 1. 2. Energy and force calculations for given conformations based on a classical molecular dynamic model Optimization of conformations to find local minima of energy In our molecular dynamic model, to take the solvent (water molecules surrounding the proteinligand complex) effect into account, we apply an implicit solvent model where the electrostatic interaction between the solvent and the solute (the complex of protein and a ligand) is computed by solving potential equations with an inhomogeneous dielectric, and the non-polar interaction between the solute and the solvent is taken into account by assuming that it is proportional to the molecular surface area. In our optimization, we allow mobility only for the ligand and some important molecules and the side chains of the protein in the active site (a region where the docking takes place). We exploited such conditions (the localized mobility and the implicit solvent model) to reduce computational load. And we have been able to realize the feasible computational time for the local optimization. Here, the local optimization means finding a local minimum close to a given initial conformation. One shot of the optimization takes around a few minutes with a linux PC (Pentium 4) at the current implementation even when the solvent effect computation with the implicit solvent model is combined. To achieve such a feasible execution time, we have developed the following novel methods. 1. 2. 3. Constrained optimization techniques to reduce number of optimization steps. A multilevel optimization technique with which the number of computation of the energies and forces due to the solvation effect on the fine grid are reduced by the use of the coarser grids. Acceleration of the convergence of the above multilevel optimization iterations using a Krylov subspace technique. Both of an initial conformation and a resultant conformation obtained by our optimization code from it are depicted in Fig. 5. In this example, the initial conformation is taken from experimental results, the atoms of the ligand move 0.84 Angstrom in average during optimization. Though the deformation is not so large, the optimization reduces the energy dramatically (the reduction of the energy from the first iteration to the last is about 100 kcal/mol). Thus, to obtain a reasonable energy for the screening, such an optimization is needed. In Fig. 7, the convergence history of the energy for this example is given. The horizontal axis indicates the number of evaluations of forces. About one thousand of the evaluations were performed in this case. However, among them, only one hundred times of evaluations of the solvation effects on the finest grid were needed by employing the coarse grids computations. For this computation, 242 seconds was consumed with our Linux PC with Pentium 4 processor. Fig. 5. Conformations before and after the energy optimization Fig. 6. The conformation change from the initial one (the wires) and the optimized one (the sticks) including the mobile sidechains and the mobile water molecules. Fig. 7. Convergence history of the energy Though we have made above mentioned efforts to realize fast computation of the energy optimization, it is still not feasible and not efficient enough to apply our code directly to all of possible candidates found in libraries, since such optimization problems have too many local minima. At the moment, we do not have any useful solution to resolve this difficulty by our code. However, we can use existing docking prediction codes which produce a set of possible conformations that may be close to conformations of small energies. In general, these codes use simpler energy models and search conformations in roughly discretized spaces to achieve high throughput. Thus, it makes sense to use the given set of conformations by such tools as the initial conformations of our optimization codes and to evaluate the energies after the optimizations with our more accurate energy model. Furthermore, the most of the existing docking prediction tools fix the protein structure during optimization, whereas our code can allow mobility for any part of proteins as can be seen in Fig. 6. We think that this is also important improvement for the energy evaluation, since the protein is also flexible. After all computations are completed for all the given conformations through all the candidates, we can pick up candidates of the medicine by selecting ones which give smaller binding energies at their best docking conformations. This is called two stage screening strategy. By saving also atomic coordinates of all the optimized conformations, we can use them later on to understand detailed mechanisms of the protein-ligand interactions. And such knowledge might be exploited to design a new compound. Conventional docking prediction code ligand 1 Optimization code with precise MD potential model and calculation of binding energies Selection with binding energies conf_init(lig1,1) conf_opt(lig1,1) conf_init(lig1,2) conf_opt(lig1,2) screened candidates conf_init(lig1,n1) conf_opt(lig1,n1) ligand i1 ligand i2 ligand k conf_init(ligk,1) conf_opt(ligk,1) conf_init(ligk,2) conf_opt(ligk,2) conf_init(ligk,nk) conf_opt(ligk,nk) ligand im Fig. 8. Work flow of the two screening strategy In Fig. 7, an image of the two stage screening is shown. With the first docking prediction tool, some conformations (may be up to 100) are given for each of ligands. Then, all of conformations are optimized by our codes through all the ligands. At the end of each optimization, the computed conformation is saved with its energy in an output file. Here, the coordinates only of the mobile atoms (in the ligand, some molecules around the docking site, and some side chains of the protein) are saved. In case of the previous example, 5K byte of capacity is required to store one conformation of the mobile atoms. Thus, 0.5 M byte of capacity is needed to store one hundred conformations. If we need to compute about ten thousand ligands, 5G byte of capacity is needed to store all of the conformations. As for the computational time, one hundred of optimizations might be done in 10 hours. Thus, if we can use a thousand of PCs, all of optimizations for ten thousand ligands might be finished within 5 days. Even though the required data capacity is not so large for one case (one targeted protein), we may need to perform calculations over different parameter sets of the molecular dynamic model to confirm obtained results. Thus, it may not be possible to save all the output files in one local storage device. Therefore, it is convenient to have tools that allow us to access specified data in distributed output files over some network. In the following sections, we will discuss these issues. Data Formats When the optimization is finished, a text file are produced: recording of all the detailed information on docking conformations for the mobile atoms and the corresponding energy values for the example in the last section. The formats of the file data are illustrated in Fig. 9. The way GFarm should be used to handle these files is related to these formats. 526 56.350 1.623 37.399 537 57.303 -0.588 38.898 533 57.314 0.393 40.342 …… 3070 60.164 -5.990 28.251 3071 60.946 -7.590 28.186 @EOM 1 -258.17478612 …… 526 56.815 0.254 39.382 527 57.033 1.500 38.522 …… 3071 61.060 -8.674 28.622 @EOM 12 -259.04448459 …… Fig. 9. An example result file In Fig. 9, each result file contains many items that separated using a marked line “@EOM”. The lines as separators also include a counter and corresponding energy information. Each item contains many lines, each including conformation information of a mobile atom. The data processing problem is that drug design requires that thousands of ligand should be tried, each resulting in such a file. The total amount of data is potential to be very large if the number of ligand increase. GFarm is required to be a data management infrastructure and importing and exporting data are two basic tools that must be provided. GFarm Tools gfmed_import This is the tool built on top of the gfimport_text tools and is designed to import multiple files as shown in Fig. 9 into the GFarm infrastructure. The user can specify a list of file names and the gfmed_import will first combine all of these separate files into a big file and then import it to be a GFarm file. In order to distinguish the data from different files, the file names of these original separate files (usually the corresponding ligand name) are inserted after the mark of each item “@EOM” to indicate the source of the data item. For the convenience of the gfmed_export, gfmed_import is design not to separate data in the same item into different file fragments. Existing GFarm APIs can not handle these issues since the size and number of lines of each item is various especially for data from different ligands. The solution is that though the gfmed_import still follows the almost-the-same-size principle, the import into one fragment will not stop until a “@EOM” mark is passed. gfmed_export The gfmed_export can be used to retrieve specific data that belongs to one ligand from a GFarm multiple docking conformation result file achievement, given the name of the ligand. The tool will parse the data and once a “@EOM” mark is found, the ligand name information after the “@EOM” mark inserted during the gfmed_import is checked. If matched, the whole data item will be exported. The gfmed_import and gfmed_export tools can be used together for handling large amount of data produced by the drug design application described in the last section, especially when the local data storage is not enough. The output file of gfmed_export will be exactly the same as those original source file before operated by the gfmed_import. Performance Evaluation The gfmed_import and gfmed_export tools described above are tested in a mini scale using the test bed built with a PC cluster that has 6 nodes in the CCRL NEC Germany and a cluster with 4 nodes in the CRL NEC Japan. Different number of GFarm file nodes and ligand files are selected and corresponding processing time are recorded as illustrated in Figs. 10 and 11. In these experiments, total 320 ligand files are used, since each ligand conformation information file is 0.5M in size as mentioned before, the maximum GFarm file involved in these experiments is 160M in size. Totally 10 GFarm file nodes are deployed in turn. Firstly 6 nodes in the CCRL NEC Germany are used. When the number of nodes is beyond 6, additional nodes from Japan are added. The GFarm meta server is running at the CCRL NEC Germany. And also the source ligand files are stored on the CCRL cluster. For both tools, the number of nodes has almost no effect on the execution time when the experiment is carried out only within one cluster. The execution time increases only with the increasing of the size of file. But the situation change when nodes on the Japan cluster involve. When more Japan nodes are involved, more data are required to transfer from Germany to Japan, which consume more time. This can also be used as a benchmarking program for the data transfer across the NEC Germany-Japan testbed. 1200 20 ligands 40 ligands 80 ligands 160 ligands 240 ligands 320 ligands time (s) 1000 800 600 400 200 0 1 2 3 4 5 6 7 8 9 10 # nodes Fig. 10. Experimental results: gfmed_import 1000 20 ligands 40 ligands 80 ligands 160 ligands 240 ligands 320 ligands time (s) 800 600 400 200 0 1 2 3 4 5 6 7 8 9 10 # nodes Fig. 11. Experimental results: gfmed_export 5. Conclusions and Suggestions GFarm is designed for data intensive computing applications for data management, scheduling and storage. From the initial experiences described in this work, we found that some medical simulation applications can indeed benefit from the GFarm infrastructure. The programming APIs are expected to be extended further since the GFarm tools are still under development. For example, in the second case described in this work, a gfs_pio_grep function would be helpful that can return the position of a particular string in a GFarm file fragment, since the data are processed according to some predefined marks.