2. Grid Datafarm

advertisement
Optimization of Docking Conformations
Using Grid Datafarm
METI Project Report
March 2003
Junwei Cao and Takumi Washio
C&C Research Laboratories, NEC Europe Ltd., Germany
{cao, washio}@ccrl-nece.de
1. Introduction
Grid computing originated from a new computing infrastructure for scientific research and
cooperation and is becoming a mainstream technology for large-scale resource sharing and
distributed application integration. Data access and management is one of the key services that must
be provided by the grid infrastructure and related topics include data transfer, replica, visualization
and so on.
Grid Datafarm (GFarm) is a Japanese national project that aims to design an infrastructure for
global petascale data intensive computing. GFarm tools and APIs are provided to handle large data
files in both single filesystem image and local file views. While the Grid Datafarm is originally
motivated by high energy physics applications, it is a generic distributed I/O management and
scheduling infrastructure that can be applied to support other data grid applications.
One of the biomedical simulation applications developed in the NEC for the docking conformation
optimization is investigated to use the GFarm infrastructure to handle large-scale simulation results.
The result files of the application contain detailed conformation information and corresponding
energy values, which potentially become very large if the considered problem size and simulation
iterations increase. In this report, two cases are considered with the second one more realistically
implemented.
This report documents some initially implemented tools, which are based on existing GFarm tools
and APIs and can be provided to users with application specific interfaces. Docking confirmation
result files can be handled and some key features of the GFarm environment are validated. In the
second case, tools are evaluated in a mini scale and some performance results are also included.
Some suggestions are also provided for future improvement of the GFarm toolkit.
2. Grid Datafarm
Architecture
The Grid Datafarm architecture couples the disk storage capabilities together with the high
performance computing processors to maximize the disk I/O advantages and avoid the bandwidth
limitation of the global network in a grid environment. A GFarm metaserver with an LDAP-based
metadata DB is utilized for management and information services to achieve a single file system.
One of the key features of the GFarm infrastructure is the support of both global and local file
views. The global view eases the file handling, and the local view can be used to achieve the
scalability of the file manipulation. These are both used in the GFarm application specific tools
introduced later.
GFarm Tools and APIs
Grid Datafarm implementation provides a series of tools and programming APIs. Apart from some
general file operation commands (e.g. gfls, gfwhere, gfreg, etc.), several example tools that are
included in the GFarm release is found to be useful especially when applied for the docking
optimization application. These include:
gfimport_fixed
The command gfimport can be used to import a file into the GFarm infrastructure by dividing the
file into same-sized fragments. If users want to define fixed-size blocks in the file and avoid
divisions in the middle of blocks, the gfimport_fixed command can be used. And the file will be
divided into almost-same-size fragments.
gfimport_text
If the considered file is a text file and the user want to avoid divisions within lines when it is
imported into the GFarm infrastructure, the tool gfimport_text can be used. And the file will be
divided into almost-same-size fragments.
gfmpirun
The command gfmpirun is used to run MPI programs that handle GFarm files when the datafarm
infrastructure is installed in a small-scale cluster computing environment. The gfmpirun firstly
looks up the locations of the corresponding file fragments and then transfers multiple copies of the
MPI programs onto these file nodes. Current gfmpirun makes system call mpirun and relies on the
specific MPI implementation. This has to be updated according to the configurations of the NEC
MPI implementation that is described below.
GFarm and NEC MPI
The implementation of the NEC MPI is especially coupled with a cluster scheduling system,
COSY. When the user requires an mpirun, the COSY scheduling will issue a resource allocation to
the user automatically. But the best way to guarantee the required resources is to reserve the
resources via COSY in advance and use an additional reservation ID jid for mpirun. The gfmpirun
is updated to include these specific options of the NEC MPI and COSY implementation.
3. Optimization of Docking Conformations
Overview
This optimization code ultimately aims at finding minimal energy conformation for given pairs of
proteins and ligands (molecules which become medicines). Such a tool can be used for screening
candidates of medicine by computer simulation before doing actual experiments. By doing this
screening process fast and accurately, the development cycle of medicines can be dramatically
reduced. For this type of minimization problem, there are a quite large number of local minima.
Thus, the many optimizations with different initial conformations must take place. Optimization
processes for different initial conformations can be performed independently from others. Thus, one
can run many optimization processes in parallel on distributed computers and can save the results of
local minimum conformations in distributed storage devices. After all processes are completed, one
would want to make ranking for all produced data, and then pick up the top energy conformations
(from the minimum) at his/her own computer. For such file manipulations, Gfarm can be used.
The development of the optimization code has been almost finished. The optimization code is
independent from the energy and force calculation routines. Namely, users can easily couple their
molecular mechanics codes with the optimization code by giving energies and forces for
conformations provided by the optimization code. This mechanism is described in Fig. 1.
Call opt(init(=0),iproc,x,ene,f,…,res)
iproc?
1
Convergence criterion
based on res
0
Compute enegy:ene
and force:f
yes
stop
no
Call opt(init(=1),iproc,x,ene,f,…,res)
end
Fig. 1. Flowchart of the optimization code
In most cases, bond lengths and bond angles of molecules are stiff. Thus, in order to accelerate the
convergence of the optimization process, one wants to allow movements only for torsion angles.
Such optimization is possible with this code. Actually, users can specify places where they want to
allow stretching, bending or twisting as they like.
In the test case, eight identical organic compounds (consisting of 37 atoms) are docked from
randomly generated initial conformations (See Fig. 2). In Fig. 3, the energies are plotted for 1000
optimized conformations. As can be shown in the figure, there are many local minima, thus many
local optimizations must take place to obtain a nearly best conformation in terms of energy. And
such the optimizations will be done for many ligands. Thus, the amount of data will grow up for
realistic situations.
The application result files are currently handled manually. After the optimization is finished, the
user needs to look into the energy values and find corresponding item index numbers with
minimum energy. The conformation information is then picked up according to the index number
and saved into a separate file that can be taken as an input file of visualization tools. This work aims
to provide tools on top of the GFarm infrastructure and facilitate these processes.
Fig. 2. Initial conformation and after the optimization
Fig. 3. Energy plots for 1000 optimization processes
Data Formats
When the optimization is finished, two text files are produced: recording of all the detailed
information on docking conformations and corresponding energy values. The formats of the file
data are illustrated in Fig. 4. The way GFarm should be used to handle these files is related to these
formats.
REMARK Conf. indx:
ATOM
1 cme DOC
ATOM
2 cr
DOC
……
ATOM
37 h
DOC
ATOM
38 cme DOC
ATOM
39 cr
DOC
……
ATOM
295 h
DOC
ATOM
296 h
DOC
1
1
1
3.806
3.041
-0.121
0.857
0.166
1.077
1
2
2
3.321
12.736
12.919
-1.097
2.642
1.920
0.139
3.256
1.909
8
8
17.502
12.409
14.129
17.110
14.077
15.873
REMARK Conf. indx:
ATOM
1 cme DOC
……
REMARK Conf. indx:
ATOM
1 cme DOC
……
ATOM
296 h
DOC
2
1
3.325
-0.862
-0.727
100
1
-0.143
-2.234
-0.464
8
14.197
13.748
11.206
1 270.474215588489 8.23462884877584 32.8818063762824 118.212317614894 -2.33626143043164 -6.730923029215533E-002
2 271.679717003881 8.23462884877540 32.8970978819788 117.755440564651 -0.927292643390787 1.910435142524919E-002
3 268.071669028748 8.23462884877595 32.8414811503397 119.196523738672 -5.16278691917328 -0.326044511801722
……
99 268.558249731899 8.23462884877596 32.8601637981316 118.801768872776 -4.57512654646738 -0.101779209448476
100 267.857273622203 8.23462884877486 32.8490472810645 118.440695791104 -4.79704329772594 -0.351354469096432
Fig. 4. Example result files
In Fig. 4, both of the files contain data of 100 items which are generated by the optimization
processes with randomly made initial conformations. Each line of one item in the first file describes
docking conformation information of one of the atoms of a certain dock. The size of each line is
fixed, therefore the total size of one item is determined by the total number of atoms involved in the
application. In Fig. 4, docking of 8 molecules are considered and all molecules are chemically
identical, and each of them is composed of 37 atoms, so the total number of atoms is 296.
The second file includes the energy values of each docking. The first column is the index number
and the second is the total energy value which is the most important and used for ranking of the
docking conformations. Each line is corresponded to one item in the first file and the size of each
line is not fixed.
The two files potentially become quite large (especially the first file) if more initial conformations
are optimized (instead of 100 in Fig. 4) and molecules are potentially composed of more atoms
(say, thousands of atoms). When the GFarm is used to handle these files, it is required to avoid
fragment divisions within one item of data.
GFarm Tools
gfopt_import
This is used to import the docking conformation result file into a GFarm file. The number of atoms
in the application should be input so that the tool can calculate the fixed size of each block
automatically and utilize the gfimport_fixed to avoid breaks in the middle of blocks.
gfopt_import_energies
This is used to import the energy result file into a GFarm file. The gfimport_text is utilised.
However, instead of the almost-same-size file division, the almost-same-line fragment division
strategy is utilized so that the division can be carried out in the same way that the gfopt_import tool
does.
gfopt_export_energies
This is actually a MPI program that takes a GFarm energy file as input. The program reads through
the energy results in a parallel way and tries to find the index number with minimum total energy
value. This is executed using the updated gfmpirun and the output index numbers can be used as
inputs of gfopt_export.
gfopt_export
The gfopt_export can be used to retrieve a specific item of data from a GFarm docking
conformation result file, given the index number. The tool can calculate the location of the file
fragment where the corresponding block of data is stored, switch to the GFarm local file view, seek
to the beginning position (using the gfs_pio_seek API) and read out a fixed-size of data. The output
can be saved in an individual file that can be used directly by visualization tools.
The process benefits from the GFarm support of local file views. Since only one of the file
fragments is opened, the performance of the operation will not decrease much when the file size
increases.
4. Screening with a Docking Optimization Code
Overview
The docking simulation code ultimately aims at finding ligands which dock tightly with a targeted
protein. Such ligands are selected from libraries of chemical compounds which include a huge
number of candidates. The main engine of our docking code is composed of two parts:
1.
2.
Energy and force calculations for given conformations based on a classical molecular dynamic
model
Optimization of conformations to find local minima of energy
In our molecular dynamic model, to take the solvent (water molecules surrounding the proteinligand complex) effect into account, we apply an implicit solvent model where the electrostatic
interaction between the solvent and the solute (the complex of protein and a ligand) is computed by
solving potential equations with an inhomogeneous dielectric, and the non-polar interaction
between the solute and the solvent is taken into account by assuming that it is proportional to the
molecular surface area. In our optimization, we allow mobility only for the ligand and some
important molecules and the side chains of the protein in the active site (a region where the docking
takes place). We exploited such conditions (the localized mobility and the implicit solvent model)
to reduce computational load. And we have been able to realize the feasible computational time for
the local optimization. Here, the local optimization means finding a local minimum close to a given
initial conformation. One shot of the optimization takes around a few minutes with a linux PC
(Pentium 4) at the current implementation even when the solvent effect computation with the
implicit solvent model is combined. To achieve such a feasible execution time, we have developed
the following novel methods.
1.
2.
3.
Constrained optimization techniques to reduce number of optimization steps.
A multilevel optimization technique with which the number of computation of the energies
and forces due to the solvation effect on the fine grid are reduced by the use of the coarser
grids.
Acceleration of the convergence of the above multilevel optimization iterations using a
Krylov subspace technique.
Both of an initial conformation and a resultant conformation obtained by our optimization code
from it are depicted in Fig. 5. In this example, the initial conformation is taken from experimental
results, the atoms of the ligand move 0.84 Angstrom in average during optimization. Though the
deformation is not so large, the optimization reduces the energy dramatically (the reduction of the
energy from the first iteration to the last is about 100 kcal/mol). Thus, to obtain a reasonable energy
for the screening, such an optimization is needed. In Fig. 7, the convergence history of the energy
for this example is given. The horizontal axis indicates the number of evaluations of forces. About
one thousand of the evaluations were performed in this case. However, among them, only one
hundred times of evaluations of the solvation effects on the finest grid were needed by employing
the coarse grids computations. For this computation, 242 seconds was consumed with our Linux PC
with Pentium 4 processor.
Fig. 5. Conformations before and after the energy optimization
Fig. 6. The conformation change from the initial one (the wires) and the optimized
one (the sticks) including the mobile sidechains and the mobile water molecules.
Fig. 7. Convergence history of the energy
Though we have made above mentioned efforts to realize fast computation of the energy
optimization, it is still not feasible and not efficient enough to apply our code directly to all of
possible candidates found in libraries, since such optimization problems have too many local
minima. At the moment, we do not have any useful solution to resolve this difficulty by our code.
However, we can use existing docking prediction codes which produce a set of possible
conformations that may be close to conformations of small energies. In general, these codes use
simpler energy models and search conformations in roughly discretized spaces to achieve high
throughput. Thus, it makes sense to use the given set of conformations by such tools as the initial
conformations of our optimization codes and to evaluate the energies after the optimizations with
our more accurate energy model. Furthermore, the most of the existing docking prediction tools fix
the protein structure during optimization, whereas our code can allow mobility for any part of
proteins as can be seen in Fig. 6. We think that this is also important improvement for the energy
evaluation, since the protein is also flexible. After all computations are completed for all the given
conformations through all the candidates, we can pick up candidates of the medicine by selecting
ones which give smaller binding energies at their best docking conformations. This is called two
stage screening strategy.
By saving also atomic coordinates of all the optimized conformations, we can use them later on to
understand detailed mechanisms of the protein-ligand interactions. And such knowledge might be
exploited to design a new compound.
Conventional docking
prediction code
ligand 1
Optimization code with
precise MD potential
model and calculation
of binding energies
Selection with
binding energies
conf_init(lig1,1)
conf_opt(lig1,1)
conf_init(lig1,2)
conf_opt(lig1,2)
screened
candidates
conf_init(lig1,n1)
conf_opt(lig1,n1)
ligand i1
ligand i2
ligand k
conf_init(ligk,1)
conf_opt(ligk,1)
conf_init(ligk,2)
conf_opt(ligk,2)
conf_init(ligk,nk)
conf_opt(ligk,nk)
ligand im
Fig. 8. Work flow of the two screening strategy
In Fig. 7, an image of the two stage screening is shown. With the first docking prediction tool, some
conformations (may be up to 100) are given for each of ligands. Then, all of conformations are
optimized by our codes through all the ligands. At the end of each optimization, the computed
conformation is saved with its energy in an output file. Here, the coordinates only of the mobile
atoms (in the ligand, some molecules around the docking site, and some side chains of the protein)
are saved. In case of the previous example, 5K byte of capacity is required to store one
conformation of the mobile atoms. Thus, 0.5 M byte of capacity is needed to store one hundred
conformations. If we need to compute about ten thousand ligands, 5G byte of capacity is needed to
store all of the conformations. As for the computational time, one hundred of optimizations might
be done in 10 hours. Thus, if we can use a thousand of PCs, all of optimizations for ten thousand
ligands might be finished within 5 days. Even though the required data capacity is not so large for
one case (one targeted protein), we may need to perform calculations over different parameter sets
of the molecular dynamic model to confirm obtained results. Thus, it may not be possible to save all
the output files in one local storage device. Therefore, it is convenient to have tools that allow us to
access specified data in distributed output files over some network. In the following sections, we
will discuss these issues.
Data Formats
When the optimization is finished, a text file are produced: recording of all the detailed information
on docking conformations for the mobile atoms and the corresponding energy values for the
example in the last section. The formats of the file data are illustrated in Fig. 9. The way GFarm
should be used to handle these files is related to these formats.
526 56.350 1.623 37.399
537 57.303 -0.588 38.898
533 57.314 0.393 40.342
……
3070 60.164 -5.990 28.251
3071 60.946 -7.590 28.186
@EOM
1 -258.17478612
……
526 56.815 0.254 39.382
527 57.033 1.500 38.522
……
3071 61.060 -8.674 28.622
@EOM
12 -259.04448459
……
Fig. 9. An example result file
In Fig. 9, each result file contains many items that separated using a marked line “@EOM”. The
lines as separators also include a counter and corresponding energy information. Each item
contains many lines, each including conformation information of a mobile atom. The data
processing problem is that drug design requires that thousands of ligand should be tried, each
resulting in such a file. The total amount of data is potential to be very large if the number of ligand
increase. GFarm is required to be a data management infrastructure and importing and exporting
data are two basic tools that must be provided.
GFarm Tools
gfmed_import
This is the tool built on top of the gfimport_text tools and is designed to import multiple files as
shown in Fig. 9 into the GFarm infrastructure. The user can specify a list of file names and the
gfmed_import will first combine all of these separate files into a big file and then import it to be a
GFarm file. In order to distinguish the data from different files, the file names of these original
separate files (usually the corresponding ligand name) are inserted after the mark of each item
“@EOM” to indicate the source of the data item.
For the convenience of the gfmed_export, gfmed_import is design not to separate data in the same
item into different file fragments. Existing GFarm APIs can not handle these issues since the size
and number of lines of each item is various especially for data from different ligands. The solution
is that though the gfmed_import still follows the almost-the-same-size principle, the import into one
fragment will not stop until a “@EOM” mark is passed.
gfmed_export
The gfmed_export can be used to retrieve specific data that belongs to one ligand from a GFarm
multiple docking conformation result file achievement, given the name of the ligand. The tool will
parse the data and once a “@EOM” mark is found, the ligand name information after the “@EOM”
mark inserted during the gfmed_import is checked. If matched, the whole data item will be exported.
The gfmed_import and gfmed_export tools can be used together for handling large amount of data
produced by the drug design application described in the last section, especially when the local data
storage is not enough. The output file of gfmed_export will be exactly the same as those original
source file before operated by the gfmed_import.
Performance Evaluation
The gfmed_import and gfmed_export tools described above are tested in a mini scale using the test
bed built with a PC cluster that has 6 nodes in the CCRL NEC Germany and a cluster with 4 nodes
in the CRL NEC Japan. Different number of GFarm file nodes and ligand files are selected and
corresponding processing time are recorded as illustrated in Figs. 10 and 11. In these experiments,
total 320 ligand files are used, since each ligand conformation information file is 0.5M in size as
mentioned before, the maximum GFarm file involved in these experiments is 160M in size. Totally
10 GFarm file nodes are deployed in turn. Firstly 6 nodes in the CCRL NEC Germany are used.
When the number of nodes is beyond 6, additional nodes from Japan are added. The GFarm meta
server is running at the CCRL NEC Germany. And also the source ligand files are stored on the
CCRL cluster.
For both tools, the number of nodes has almost no effect on the execution time when the experiment
is carried out only within one cluster. The execution time increases only with the increasing of the
size of file. But the situation change when nodes on the Japan cluster involve. When more Japan
nodes are involved, more data are required to transfer from Germany to Japan, which consume
more time. This can also be used as a benchmarking program for the data transfer across the NEC
Germany-Japan testbed.
1200
20 ligands
40 ligands
80 ligands
160 ligands
240 ligands
320 ligands
time (s)
1000
800
600
400
200
0
1
2
3
4
5
6
7
8
9
10
# nodes
Fig. 10. Experimental results: gfmed_import
1000
20 ligands
40 ligands
80 ligands
160 ligands
240 ligands
320 ligands
time (s)
800
600
400
200
0
1
2
3
4
5
6
7
8
9
10
# nodes
Fig. 11. Experimental results: gfmed_export
5. Conclusions and Suggestions
GFarm is designed for data intensive computing applications for data management, scheduling and
storage. From the initial experiences described in this work, we found that some medical simulation
applications can indeed benefit from the GFarm infrastructure.
The programming APIs are expected to be extended further since the GFarm tools are still under
development. For example, in the second case described in this work, a gfs_pio_grep function
would be helpful that can return the position of a particular string in a GFarm file fragment, since
the data are processed according to some predefined marks.
Download