Improvement of the Precorrected-FFT implementation of Biomolecule by

advertisement
Improvement of the Precorrected-FFT implementation of Biomolecule
Electrostatics Simulation
by
Meng-Jiao Wu
Submitted to the Department of Electrical Engineering and Computer Science in Partial
Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical
Engineering and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
May 19, 2003
Copyright 2003 M.I.T. All rights reserved.
OFTECHNOLOGY
JUL
0 2003
Author
Department~f lfectrical Engineering and Computer Science
May 13, 2003
Certified by_
Jacob K. White
Professor
Thesis Supervisor
Accepted by
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
BARKER
Improvement of the Precorrected-FFT implementation of Biomolecule
Electrostatics Simulation
by
Meng-Jiao Wu
Submitted to the Department of Electrical Engineering and Computer Science in Partial
Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical
Engineering and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
May 19, 2003
ABSTRACT
The performance of the original implementation of biomolecule electrostatic simulation
using the precorrected-FFT method can be improved via both pre-processing input data
and modifications to the implementation. Techniques used for these improvements are
described and the resulting effects are analyzed. Potential areas for further speedup are
discussed.
Thesis Supervisor: Jacob K. White
Title: Professor
2
Acknowledgements
First of all, I would like to thank Prof. Jacob White for giving me the opportunity to
work on this project and for his advices and support.
I must thank Michael Altman and Shihhsien Kuo for numerous discussions, ideas, and
guidance.
I also want to thank Ben Song and Zhenhai Zhu for answering my endless questions
on pfft++ and helping me with many technical difficulties.
I would like to acknowledge Michal Rewienski for helping me with setting up the
Intel Compiler and related issues.
Also, I would like to thank Chad Collins for providing support on software and
facilities.
Thanks should go to many others in the group who constantly give me encouragement
and entertainment, including but not excluding Anne Vithayathil, Carlos Pinto Coelho,
David Willis, Dimitry Vasilyev, Jaydeep Bardhan, Jung Hoon Lee, and Xin Hu.
Lastly, I want to thank my girlfriend, Annie Wang, for her care and patience.
3
Contents
6
1 Introduction
1.1 Mixed Discrete-Continuum Formulation.................................................7
8
1.2 Integral Equation Formulation........................................................
1.3 N um erical Solution .....................................................................
10
1.4 P fft + + .......................................................................................
11
13
2 Input and output
2.1 Molecular surface meshes ..............................................................
13
2.2 A tom charges .............................................................................
13
2.3 Dielectric constants and salt concentration...........................................13
3 Defective Meshes
15
4 Algorithm Inefficiencies
17
4.1 MaxBoundingSphereRadius ...........................................................
17
4.2 pow and __ieee754pow ...................................................................
19
5 Rotating input meshes
21
6 Intel Compiler
23
6.1 Interprocedural Optimizations .......................................................
24
6 .2 V ectorization ...............................................................................
25
6.3 Profile-guided Optimizations ...........................................................
26
7 Future Work
28
8 Conclusion
30
4
List of Figures
1 The continuum, model of a solvated protein .................................................
8
2 A pictorial representation of the precorrected FFT algorithm .............................
12
3 EC M m olecule m eshes .........................................................................
14
4 Defective meshes-
small triangles ............................................................
15
5 C ollap sing v ertices ..............................................................................
16
6 D iagonally oriented bar .....................................................................
21
7 bar exam ple after re-orientation ...............................................................
23
8 A small example of meshes with 300 panels ..............................................
28
5
1 Introduction
Electrostatic interactions play an important role in macromolecular systems, and
therefore much effort has been devoted to accurate modeling and simulation of
biomolecule electrostatics [1]. The computation of the strength of electrostatic
interactions for a biomolecule in an electrolyte solution, as well as the potential that the
molecule generates in space, is an essential element of this effort [1]. There are two
purposes of this type of simulations. First, it improves the understanding of the
relationships between electrostatics and stability, function, and molecular interactions.
Second, it serves as a tool for molecular design because electrostatic complementarity is
an important feature of interacting molecules [1].
One way to simulate a protein macromolecule in an aqueous solution with nonzero
ionic strength is to adopt a mixed discrete-continuum approach based on combining a
continuum description of the macromolecule and solvent with a discrete description of
the atomic charges [1]. An efficient procedure to solve the mixed discrete-continuum
model numerically is to combine a carefully chosen integral formulation of the mixed
discrete-continuum model with a fast integral equation solver, which is the precorrectedFFT (pFFT) accelerated method [1].
6
1.1 Mixed Discrete-Continuum Formulation
One popular simplified model for biomolecule electrostatics is to approximate the
interior of a protein molecule as a collection of point charges in a uniform dielectric
material, where the dielectric constant is typically two to four times larger than the
permittivity of free space [1]. Any surrounding solvent is modeled as a much higher
permittivity electrolyte whose behavior is described by the Debye-Huckel theory [1].
The interface between the protein and the solvent is defined by determining how close the
solvent molecules can approach the biomolecule [1].
As depicted in Figure 1, Region I corresponds to the interior of the protein and Region
II corresponds to the surrounding solvent. The electrostatic behavior in Region I, the
protein interior, is governed by a Poisson equation
it,
()=
-
(
-(Region
I)
(1)
where (p, is the electrostatic potential, F is an evaluation position, if is the location of
the P'" protein point charge, q, is the point charge strength, n, is the number of point
charges and e, is the dielectric constant in the protein interior. The electric potential in
the solvent, Region II of Figure 1, is presumed to satisfy the Helmholtz equation [1]
V 2P 2 (r)- K2 (P2() = 0
7
(Region II)
(2)
Molecular
Q
surface=
Region II
El
E2
Region I......
n
nt'Charges
ro
r
Figure 1: The continuum,
model of a solvated protein
1.2 Integral Equation Formulation
We can start with the fundamental solutions to (1) and (2), respectively:
G, (1;')=
_
41r -j
-r
G2 (e; ')=
.
47r - r'
8
|
(3)
(4)
These two solutions along with Green's second theorem can be used to produce integral
equations for the potential and its normal derivative. The integral equations for Region I
and II are the following, respectively:
a')an -- (#
p(F)= f[G (r;
' G(;]dIi'+
an
G,1(F;)
(5)
7= El
Q2
TP2 (r)
f
[-G2
(r;')
+ T2V')
2
(6)
;F')]d'
F')3G
The potentials satisfy boundary conditions at the molecular surface Q :
(PI())= (P2(F>
)
a(p,()
Dp 2 )
an
an
where
(7)
(8)
, e Q and E = E2 / E, is the relative dielectric constant of the two regions. For the
boundary conditions to be met, we take the limits of equations (5) and (6) as F -+ Q
from the inside and outside of the molecular surface, respectively, and then we substitute
the limits into equations (7) and (8). The limits are
(p, (,)
= im, ()
=
[G, (r§; '7)
Q2n
=
limP2 ( )
(9)
-<(pI (V') aG,(;; ) ]d'+
an
2
+ IG,;
i=1 E,
(10)
and
(P2
())
=J[62(17;?)
2(K;'
= [(
0
(11)
P2 (r')
an
+ T2(V')
9
3G2 (6;'
an
Fd'-
]
2
T2(F)
(12)
A pair of integral equations can be obtained after substituting (10) and (12) into (7) and
(8),
1
f~(r
2
QL
(P,(F,)+f
I1(8,)+
2
I(F;)rJ~
()
-
G, (Y,
an
a(F)G
2 (;
,
Q[
)-
an
r)dr f=$
Y
f(
n,__
G
d=
FO
an
Gj=;
G,8;)
0) 1
(13)
(3
0
(14)
H E,
E an
Equations (13) and (14) are used to calculate (p, and
0
' on Q , the molecular surface.
an
These surface potentials and their normal derivatives can then be used in (5), (6), (7), and
(8) to calculate potentials at any location. In our biomolecule electrostatic simulation,
potentials at the locations of the point charges are computed according to the following
formula,
[G (;'p,)(r)aGI(Ftjr'
(p ()=
an
Q
an
) ]di'
(15)
and an approximate of the solvation energy of the molecule is obtained by summing over
the products of the point charges and their potentials.
1.3 Numerical Solution
The surface is first discretized into a set of panels, and a piecewise constant basis
function, B,., is associated with each panel [1]. The potentials are then represented as a
weighted combination of the panel basis functions [1]. That is,
p(,=
a B
( )
(16)
k
ap,
an
XbkBk(7
I
k
10
(17)
The basis function weights are determined by insisting that when (16) and (17) are
substituted for the potential and its normal derivative in (13) and (14), the resulting
equations are exactly satisfied for those values of r, which correspond to panel centroids
[1]. The resulting system of equations is as follows:
j
1
2G Ial
3 G~d
d
G _.
dr
panelk
a
G, d
d
2
-I
G
anen
-
bkn
G2dr
-
'1
G
E i G
=1E0
(18)
panelk
Equation (18) can be solved by an iterative approach with a precorrected-FFT algorithm
[1].
1.4 Pfft++
In our implementation, we use the iterative solver GMRES [2], to solve the matrix
equation (18). During each iteration of GMRES, four matrix-vector multiplications need
to be carried out:
F~~3
1 d1[a]
f an d [k(I9a),
fG,dr [ak
panelk
_panelk
ii2
Ib
(19b)
dP l[ak]
f d'an a(1
JG 2 dF][ak]
9c),
(19d)
panel
_panel,
These matrix-vector multiplications can be computed efficiently using pfft++, a general
and extensible fast integral equation solver based on a pre-corrected FFT algorithm [3].
11
The algorithm basically consists of four steps, as can be seen from Figure 2, where a
given set of panels from a discretized surface are superimposed on a uniform grid [1].
Figure 2: A pictorial
representation of the
precorrected FFT algorithm
kL
projection
interpolation
Nearb:
interaction
convolution
First, panel charges are projected onto their associated grid points, which are essentially
the closest grid to the center of each panel [1]. This is called the projection step [1].
Second, given the distribution of grid charges, the potential on each grid can be computed
using a convolution of the Green's function (the kernel) and the grid charges [1]. Since
the Green's function is position invariant, the matrix corresponding to this convolution is
a multilevel Toeplitz matrix [3]. The number of levels is 2 and 3 for 2D cases and 3D
cases, respectively [3]. This convolution is basically a Toeplitz matrix vector product
which can be computed efficiently using the Fast Fourier transform (FFT). Third, grid
potentials are interpolated back onto the panels, a step called interpolation [1]. In the last
12
step, known as precorrection, nearby interactions are computed directly from the integral,
and the contributions from the convolution step are removed for these nearby interactions
[1].
2 Input and output
The input to our simulation consists of the following:
2.1 Molecular surface meshes
These are the panels that make up of a discretized molecular surface as needed for our
numerical solution. For a given molecule, we generate the molecular meshes by using
the program MSMS [4], which takes as inputs the position and radius of each atom in a
molecule and produces a discretized surface consisting of triangles.
2.2 Atom charges
Each atom has a point charge situated at its center; these charges are the point charges
q, that appear in the derivation. They are not necessarily the same even for the same
atoms, as they are derived from quantum mechanical calculations [1].
2.3 Dielectric constants and salt concentration
The inner dielectric constant e, , the outer dielectric constant
E2 ,
and the salt
concentration in region II which determines K , need to be specified.
Our simulation takes in these parameters and solves equation (6) for ak and b, , thus
obtaining the potentials and their normal derivatives at the molecular surface from (4)
13
and (5). Then it computes the potential at each point charge and sums over the products
of point charges and their potentials to obtain the solvation energy of the molecule as the
output.
The main molecule example used is an E. coli chorismate mutase (ECM) protein
macromolecule with 3210 atoms [1]. MSMS is used to triangulate the molecular surface
to produce the meshes, which looks like the following:
30
20.
Figure 3:
ECM molec ule
meshes
-10
20
10
20
-
-10
0
In our test runs, the inner dielectric constant is always 4, and the outer dielectric constant
is always 80. There are two cases for the salt concentration. One is called the no salt
case, which corresponds to the salt concentration being zero, equivalently
14
K
=0, and
G2
(T;')= G, (r; ')=
The other one is called the salt case, which
41crF
-
corresponds to the salt concentration being 0.145 M, equivalently K = 0.124 Angstrom-I
at 25 degree C. The two cases differ mainly in that in the salt case, K # 0 and there is an
exponential term that needs to be evaluated in G2 (;
. . This difference
V')
4ir Ir - r '|
turns out to impact the performance and improvement of the simulation significantly as
we will see later.
3 Defective Meshes
The surface meshes generated by MSMS actually have some triangles with areas close
to zero, and our simulation fails with these defective triangles. If we zoom in the figure
shown above, we can see the existence of very small triangles:
Figure 4:
Defective meshessmall triangles
One way to remove the small triangles is to, for each pair of vertices that are too close to
each other, collapse one vertex onto the other and reshape the triangles associated with
15
the moved vertex. For example, in the following case vertices A and B are too close and
we move A onto the location of B, and then get rid of A:
A 7e
%e
A
Figur e 5:
C'"IIa psing vertices
B)
Vertex A
Vertex B
In our approach, we first calculate the mean distance between adjacent vertices, and then
we obtain a threshold distance by multiplying the mean distance with a factor. We then
collapse pairs of vertices whose distances are smaller than the threshold value. We then
go start the procedure again by calculating the new mean distance. This process
continues until there is no more vertex pair to be collapsed.
Obviously this process distorts the original shape of the surface meshes so ideally the
factor we use for the collapsing process should be as small as possible. For the ECM
molecule and no salt case, if we use 0.01 as the factor, the resulting meshes have 161210
panels and the solvation energy is -642.013 kcal/mol. The program takes 36 minutes and
21.275 seconds to run for this case. If we use 0.1 as the factor, the resulting meshes have
38802 panels and the solvation energy is -664.275 kcal/mol. The program takes 3
minutes and 23.521 seconds to run. This processing of the surface meshes reduces the
16
running time to about 1 percent of the original amount with the cost of 4 percent error.
Because our effort is mainly in speeding up the performance of our simulation, we use
the molecular surface meshes after 0.1 factor processing as our test example for the sake
of faster execution time.
The following is a series of improvements made to speedup the simulation. The
discussion and data given in each section are based on the assumption that previous
improvements have been made.
4 Algorithm Inefficiencies
4.1 MaxBoundingSphereRadius
We used the g++ profiler gprof command to profile our program for the ECM
molecule no salt case, and we discovered that the majority of the time is spent on one
particular function as can be seen in the report from gprof: (The total running time is 7
minutes and 41.092 seconds, or written more succinctly, 7m41.092s)
Flat profile:
Each sample counts as 0.00195312 seconds.
%
cumulative
self
self
time
seconds
seconds
calls
ms/call
40.04
187.96
187.96
98318
1.91
total
ms/call
name
1.91 pfft::GridElement::
findMaxBoundingSphereRadius(int) const
The above shows the first entry in the flat profile report. The flat profile report from
gprof consists of a list of entries with one entry for each function, and the entries are
listed in decreasing percentage time order. The first column of the entry indicates the
percentage time spent in the function over the total execution time. The second column
17
shows the total number of seconds spent in this function and all functions appearing
before it in the flat profile. The third column is the total time spent in the function,
excluding the time spent in those functions called by this function. The fourth column
shows the total number of calls to this function. The fifth column indicates the average
time spent executing this function each time it is called, and this figure counts only the
time spent executing the function itself and does not include any time spent executing any
of the functions called by this function. The sixth column is the average time spent in
this function each time it is called, and time spent executing functions called by this
function is included. The last column is the name of the function.
In the precorrection step of the pre-corrected FFT algorithm, we need to determine
which panels are considered "nearby", so we can compute the nearby interactions.
Because we associate each panel with a grid, we define two panels as nearby if their
associated grids are nearby, and two grids are considered nearby if their distance is
smaller than a certain threshold value. For each grid, we define a
MaxBoundingSphereRadius, which is the radius at which a sphere centered at the grid
can encompass all the panels associated with that grid. In the original algorithm, the
greatest MaxBoundingSphereRadius is found among all the grids, and the threshold value
for determine nearby grids is set to be twice the greatest MaxBoundingSphereRadius, and
therefore the threshold value is constant for every grid. However, it is realized that each
grid should have a different threshold value set to be twice the grid's
MaxBoundingSphereRadius.
Intuitively, the larger the panels associated with a grid is,
the more grids should be considered nearby to that grid. On the other hand, if panels
18
associated with one grid are all small and clustered tightly around the grid, then only
grids that are very close to this grid should be considered nearby. This is an inefficiency
in the algorithm in that it might over estimate the number of nearby grids to one
particular grid by using the greatest MaxBoundingSphereRadius among all grids. Also, a
bug in the implementation is also found in that for each grid it finds this greatest
MaxBoundingSphereRadius once so it repeats the process of going through all the panels
multiple times (equal to the number of grids). After making a change in the algorithm to
find the MaxBoundingSphereRadius for each grid and use it to determine the threshold
value for finding nearby grids , this function "pfft::GridElement::
findMaxBoundingSphereRadius(int) const" disappears from the top list of timeconsuming functions, and the total running time drops to 4m39.996s:
Flat profile:
Each sample counts as 0.00195312 seconds.
%
cumulative self
self
time
seconds
seconds
calls
ms/call
17.00
48.03
48.03
80
600.42
7.84
7.22
6.23
5.38
70.18
90.58
108.18
123.38
22.15
20.40
17.60
15.20
5.30
138.37
14.99
5.01
4.41
3.73
152.53
164.99
175.53
14.16
12.46
10.54
total
ms/call name
600.42 void pfft::Fast3DConv<...>
::operator()<... >(...)
mcountinternal
1
20402.34 160654.80 pFFTbemNoSalt(...)
__ieee754_pow
21594
0.70
0.87 pfll::DirectMat<...>::
precorrectRegularNeighbor(...)
136486073 0.00
0.00 pfft::vector3D<double>
pfft::operator-<double>(... )
fftw_notwiddle_32
fftwinotwiddle_32
pow
4.2 pow and __ieee754_pow
It is noticed that functions __ieee754_pow and pow also take up a lot of the execution
time. These function calls to the pow function in the standard math.h library are used for
19
evaluating the polynomials needed in the projection and interpolation steps of pfft++ [3].
Because only floating point numbers raised to positive integral numbers are needed for
computing the polynomials, calling the pow function is inefficient because pow uses
logarithms to compute floating point numbers to floating point powers. An efficient way
to calculate floating point number d to the positive integral power e is as follows (written
in C code):
function intpow returns d to the power of e,
double intpow(double d, int e){
double product = 1.;
double increment = d;
while(e>O){
if ((e%2)!=O){
product =product * increment;
e=e- 1;
}else{
increment=increment*increment;
e=e/2;
}
}
This is an order O(log(e)) function. The basic idea is that if we write e in binary
representation e = (en.. .e2 e1 )2 , where e, is either 0 or 1.
d, d&0 )2 .. .d.
00
)2
We go from el to en, compuinge
consecutively and store the value in the variable "increment". The
variable "product" is our container for the final answer and it starts as 1. Whenever we
hit some ei that is equal to 1, we multiply d(eO .0), which is saved in "increment", into the
product and thus by the time we finish en we have carried out the whole computation.
Replacing the original pow(..) function calls with calls to this function intpow(...)
reduces the amount of time spent for executing the ECM no salt case from to 4m39.996s
to 4m7.082s.
20
5 Rotating input meshes
The orientation of tle input meshes sometimes affects the speed of the simulation
greatly. In the following example, an artificial rectangular bar with 20101 panels is
oriented diagonally. Our grids line up with the axes and need to cover the whole
structure, and therefore the grids will occupy the whole bounding box. As we can see,
the majority of the bounding box is empty and putting grids there is wasteful.
250
200
150
100
1-
50-
300
---
-50
-100
Figure 6:
Diagonally oriented
bar
-0
-150
2
Because the orientation of the input meshes should not affect the simulation result, we
can rotate the meshes to a certain orientation such that the volume of the bounding box of
the bar is minimized, in hope of reducing the number of grids used (though in reality
rotating the input meshes does affect the simulation result due to numerical
21
discrepancies). We use a program of which algorithm is described in [5] and the code
available at http:/ /alis.cs.uiuc.edu
-sariel 'researchi papers '00diameter diam prog.html
to rotate the mpt meshes to an orientation at which the bounding box is approximately
minimized. This is a heuristics program that takes in a set of points and output an
approximately minimal rectangular box that bounds the point set. This program needs
two input heuristics parameters, "grid-size" and 'samplesize", in the function:
gdiam bbox
gdiam-approx mvbb-grid sample( gdiampoint * start, int size,
int grid-size, int sample-size )
where "*start" is a pointer to the point set, "size" is the number of points in the set.
Roughly speaking, the higher these two heuristics parameters "grid size" and
'sample-size" are, the better the resulting approximate minimal bounding box is, but also
the longer it takes to compute this approximate minimal bounding box. Smaller
bounding box in general results in fewer grids and less execution time in the convolution
step of pfft++, but it also takes longer to find an orientation that leads to smaller
bounding box. Therefore, there is a trade-off between time spent in pfft++ convolution
and the re-orientation step. In principle, one needs to optimize the two heuristics
parameters "grid-size" and 'sample-size" to achieve minimal overall execution time.
To demonstrate that re-orienting input meshes does give significant improvement
in execution time, we compare the performance of the original program with that of the
one which incorporates the re-orientation step. With the diagonally oriented bar of 20101
panels as input, we find the optimized "gridsize" and 'samplesize" to be 5 and 270. We
run the simulation with no salt case, the original program produces an array of grids that
have 128 grids in x,y, and z directions, and it spends 10m36.473s in total and 7m16.64s is
22
spent in the convolution step. For the program with re-orientating input meshes, the
simulation takes only 3m38.902s and iml.57s is spent in the convolution step. The reoriented input meshes looks like the following and apparently it has much smaller
bounding box size because it is very thin in one direction (note that it is not minimal,
because finding minimal bounding box will be too time-consuming compared to the
amount of time that can be saved in the convolution step). The array of grids produced
for the re-oriented meshes have (256,8,256) grids in the x, y, and z directions. That is,
only one fourth of the original total grid numbers.
0J
Figure 7:
bar example after
re-orientation
-20
_40
.2W
/4
0
-4
6 Intel Compiler
As pointed out earlier, when simulating for the salt case, the exponential term in (4)
would need to be computed, and this actually takes up a significant amount of execution
time. For the ECM with salt case, the most time-consuming function is the
23
__ieee754_exp function, which consumes 48.63 seconds out of the total running time
8m53.934s. Because the Intel® C++ Compiler 7.1 is known to produce faster programs
than gcc, especially for mathematical functions such as exponential, we compiler our
code with the Intel Compiler in hope of speeding up our simulation. The resulting
program spends 8m36.186s on the simulation, a slight improvement over the original
one. Although
ieee754_exp function is not dominant time-consuming function for the
new program, other functions show up on the top list of time-consuming functions, most
noticeably "std::vector<... >::size() const" (37.52 seconds) and "std::vector<...>::begin(
const" (17.15 seconds). This indicates that the optimization "inline function expansion"
may not be turned on and as a result function calls to the standard library take up much
time.
6.1 Interprocedural Optimizations
According to the Intel® C++ Compiler User's Guide, one can use -ip and -ipo flags
to enable interprocedural optimizations(IPO), which include optimizations such as inline
function expansion, interproceduaral constant propagation, monitoring module -level
static variables, dead code elimination, propagation of function characteristics, and
multifile optimizations (which basically does all the above optimizations across modules,
and needs another flag -ipo-obj to enable in our case) [6]. After enabling this
optimization, the total running time drops from 8m36.186s to 8m23.436s and the time
spent in "std::vector<... >::size() const" drops from 37.52s to 35.96s, and the time spent
and "std::vector<.. .>::begino const" becomes negligible.
24
6.2 Vectorization
The Intel Compiler also provides an optimization called vectorization, which is carried
out by a component of the compiler called the vectorizer, that uses SIMD instructions in
the MMXTM, SSE, and SSE2 instruction sets [6]. According to the user's guide, the
vectorizer "detects operations in the program that can be done in parallel, and then
converts the sequential program to process2, 4, 8, or 16 elements in one operation,
depending on the data type" [6]. As an example, the following is a piece of code that can
be vectorized [6]:
i=0;
while (i<n)
{
/*original loop code */
a[i] = b[i] + c[i];
i++;
After being vectorized , the code is changed to be the following , reducing the number of
assignments that need to be done by 4:
/* the vectorizer generates the following two loops */
i=0;
while (i< (n-n%4))
25
{
/* vector strip- mined loop */
/* subscript [i:i+3] denotes SIMD execution /
a[i:i+3] = b[i:i:+3] + c[i:i+3];
+ 4;
while (i < n)
{
/* scalar clean-up loop */
a[i] = b[i] + c[i]
}
Double precision types is not accepted for vectorization unless using a Pentium@ 4
processor system, on which our simulation runs on, and in this case -xW compiler option
needs to be used [6]. After enabling this optimization, the execution time of our
simulation with the ECM salt case drops from 8m23.436s to 7m45.025s. The
performance of most functions are improved, though there is no function that enjoys a
particularly great amount of improvement. Noticeably, "std::vector<... >::size() const" is
still the second most time-consuming function up until this stage, spending 38.24s out of
the total 7m45.025s.
6.3 Profile-guided Optimizations
This intel compiler optimization option is a two-stage process. At the first stage, the
compiler produces a program that when executed generates a file that contains
26
information about which areas of the application are most frequently executed [6]. This
test run usually takes longer to execute because it needs to collect data during execution.
At the second stage, the compiler uses that file to generate a program that is optimized
using the information cortained in the data [6]. The improvement of this optimization is
affected by the type of input that is used to do the test run at the first stage. If at the first
stage, we run the program with ECM no salt, the test run takes 26m21.076s and the
resulting program at the second stage does the simulation for the ECM salt case in
8ml4.928s (which is worse than not doing this optimization at all). However, if we do
the test run with the ECM salt case, the test run takes 45m27.773s and the resulting
program finishes the execution of the ECM salt case in 7m5.971s. Peculiarly, if we put
both data files generated during the two separate test runs into the directory where the
compiler retrieves profile data, it will actually use both data files to produce a faster
program that in our case runs the simulation of the ECM salt case in 6m55.332s. Because
the test runs take so long to finish for these cases, one may question the feasibility of this
optimization. We can take a smaller example to do the test runs. If we use the meshes
(with 300 panels) shown in the figure below, then the test run finishes in 0m8.059s for
the no salt case and 0m13.379s for the salt case. The resulting program runs the ECM
salt case in 7m10.961s, which is, though not as good as the 6m55.332s obtained from
using ECM as test run input, still better than 7m45.025s obtained without using the
profile-guided optimization and saves a lot of test runs execution time.
27
Figure 8:
A small example of
meshes with 300 panels
25
The performance of most functions is improved by the profile-guided optimization. The
most dramatic change is that after this optimization the execution time of the function
"std::vector<... >::size const" becomes negligible, and this is where the majority of the
improvement comes from.
7 Future Work
The most time-consuming step of the simulation is the convolution step, taking up
43.58% of the total execution time for the no salt case and 41.7% for the salt case. As
described earlier, the convolution step is basically doing a multi- level Toeplitz matrix
vector product. For a grid array with m,n, and k grids in the x,y,and z directions
respectively, the matrix vector product can be carried out by expanding the three- level
Toeplitz matrix into a three- level circulant matrix with the size of each level being 2m,
2n, and 2k, and half of the entries in the circulant matrix being zero [7]. The matrix
vector product can thus be computed using a size (2m)*(2n)*(2k) 3D FFT+IFFT pair. In
28
theory, each size N 3D FFT+IFFT pair takes 5*N*log(N) floating point operations
(flops), where the logarithm is based 2. However ,because half of the entries in the threelevel circulant matrix are zero, the actual number of flops needed is 5*N*log(N)/2 in our
case. Therefore, for a grid array with m, n, and k grids in the x, ,y , and z directions, the
convolution step would spend 5*(2m)*(2n)*(2k)/2*log[(2m)*(2n)*(2k)] flops in doing
FFT+IFFT pairs. For the ECM salt case, 66.80% of the time spent in the convolution
step is computing FFT+IFFT pairs, taking up 59.14 seconds in total. Therefore, one may
wonder if there is potential improvement on computing FFT+IFFT pairs. We use the
fftw library [8] to do the FFT+IFFT pairs in our program. Therefore, it is worthwhile to
compare the efficiency at which we compute the flops in the FFT+IFFT pairs withthe
announced efficiency of the fftw library. For the ECM no salt case, the number of grids
in the x, y, and z directions are (in, n, k)=(64,64,128), which corresponds to a 3D
transform size of 128*128*256. The convolution step is computed 78 times
(multiplicities due to GMRES iterations and multiple matrix vector products needed to be
computed in equation (19)), so the total number of flops computed is
5*78*128*128*256*log(128*128*256)/2=
17994 Mflops
Divide that by the execution time 59.14 seconds, we obtain 304.25 Mflops/second
efficiency. We are using an Intel® 4 processor with 1994 MHZ clock rate that is capable
of doing 2 flops per cycle, so the maximum efficiency is 3988 Mflops/second. It seems
like we are only achieving 304.25/3988 = 7.63% of the efficiency, but it is actually not
too far off from the announced efficiency of fftw. For Pentium II at 300 MHz, fftw
achieves about 60 Mflops/second for a 3D FFT+IFFT pair of size 128*128*128 [9],
which is only 10% of the maximum efficiency = 300 MHz * 2 flops/cycle = 600
29
Mflops/second. The efficiency goes down in general as the size goes up because memory
access dominates over floating point operations for large transforms [10]. Therefore, we
might expect the efficiency for the size of our problem (128*128*256) to be lower than
10%, and 7.63& achieved in our program seems reasonable.
Since fftw is one of the fastest library available, there does not seem to be much
potential in improving the computation of FFT±IFFT pairs. However, there is still a
significant amount of time (14.47% of the total execution time for the ECM no salt case)
spent in other parts of the matrix vector product algorithm such as expanding the Toeplitz
matrix into circulant matrix, padding zeros to the input vector to make it of the
appropriate size, and retrieving the Toeplitz matrix vector product from the circulant
matrix vector product. Because only programs written in certain ways can be properly
vectorized by the Intel Compiler [6] and the currently implementation may not have been
done in such ways, there is potential in speeding up this part of the simulation by
rewriting the code to take advantage of the vectorization optimization provided by the
Intel Compiler.
8 Conclusion
Methods for improving the performance of a biomolecule electrostatic simulator have
been presented and the speed has been more than doubled (for the ECM no salt case, the
execution time drops from 7m41.092s for the original implementation to 3m23.521 s for
the profile- guided optimized implenrntation with both salt and no salt cases as test un
inputs). In extreme cases such as the diagonally oriented bar example described in
section 5, another factor of 3 improvement is possible. Speed plays an important role in
30
the simulation because as the size of the molecule grows large r and as the number of
panels increase for higher precision, the execution time will rise inevitably. In order for
our simulation to have greater applicability to areas such as drug design, faster
implementation and faster algorithms will always be pursued.
Bibliography
[1] "Fast Methods for Simulation of Biomolecule Electrostatic", Shihhsien Kuo, et al.
ICCAD 2002
[2] Y. Saad and M. Schultz. GMRES: A generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM Journalof Scientific and Statistical
Computing, 7:856=869, 1986.
[3] Zhenhai Zhu, Master Thesis, "Efficient Techniquesfor Wideband Impedance
Extraction of Complex 3-D Structures", Chapter 4, EECS Department, MIT, August
2002
[4] M. Sanner, A. J. Olson, and J. C. Spehner. Reduced surface: An efficient way to
compute molecular surfaces. Biopolymers, 38:305-320, 1996
31
[5] Gill Barequet , Sariel Har-Peled, Efficiently approximating the minimum- volume
bounding box of a point set in three dimensions, Journal of Algorithms, v.38 n. 1, p.91-
109, Jan. 2001
[6] Intel® C++ Compiler User's Guide
[7] G. H. Golub and C. F. Van Loan, Matrix Computation, The Johns Hopkins University
Press, second edition, 1989
[8] htlp://ww\w. fftw.org/
[9] http://www.fftw.org/benchfft/results/pii-300.html
[10] http://www.fftw.org/benchfft/doc/methodology.htm
32
Download