Improvement of the Precorrected-FFT implementation of Biomolecule Electrostatics Simulation by Meng-Jiao Wu Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology May 19, 2003 Copyright 2003 M.I.T. All rights reserved. OFTECHNOLOGY JUL 0 2003 Author Department~f lfectrical Engineering and Computer Science May 13, 2003 Certified by_ Jacob K. White Professor Thesis Supervisor Accepted by Arthur C. Smith Chairman, Department Committee on Graduate Theses BARKER Improvement of the Precorrected-FFT implementation of Biomolecule Electrostatics Simulation by Meng-Jiao Wu Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology May 19, 2003 ABSTRACT The performance of the original implementation of biomolecule electrostatic simulation using the precorrected-FFT method can be improved via both pre-processing input data and modifications to the implementation. Techniques used for these improvements are described and the resulting effects are analyzed. Potential areas for further speedup are discussed. Thesis Supervisor: Jacob K. White Title: Professor 2 Acknowledgements First of all, I would like to thank Prof. Jacob White for giving me the opportunity to work on this project and for his advices and support. I must thank Michael Altman and Shihhsien Kuo for numerous discussions, ideas, and guidance. I also want to thank Ben Song and Zhenhai Zhu for answering my endless questions on pfft++ and helping me with many technical difficulties. I would like to acknowledge Michal Rewienski for helping me with setting up the Intel Compiler and related issues. Also, I would like to thank Chad Collins for providing support on software and facilities. Thanks should go to many others in the group who constantly give me encouragement and entertainment, including but not excluding Anne Vithayathil, Carlos Pinto Coelho, David Willis, Dimitry Vasilyev, Jaydeep Bardhan, Jung Hoon Lee, and Xin Hu. Lastly, I want to thank my girlfriend, Annie Wang, for her care and patience. 3 Contents 6 1 Introduction 1.1 Mixed Discrete-Continuum Formulation.................................................7 8 1.2 Integral Equation Formulation........................................................ 1.3 N um erical Solution ..................................................................... 10 1.4 P fft + + ....................................................................................... 11 13 2 Input and output 2.1 Molecular surface meshes .............................................................. 13 2.2 A tom charges ............................................................................. 13 2.3 Dielectric constants and salt concentration...........................................13 3 Defective Meshes 15 4 Algorithm Inefficiencies 17 4.1 MaxBoundingSphereRadius ........................................................... 17 4.2 pow and __ieee754pow ................................................................... 19 5 Rotating input meshes 21 6 Intel Compiler 23 6.1 Interprocedural Optimizations ....................................................... 24 6 .2 V ectorization ............................................................................... 25 6.3 Profile-guided Optimizations ........................................................... 26 7 Future Work 28 8 Conclusion 30 4 List of Figures 1 The continuum, model of a solvated protein ................................................. 8 2 A pictorial representation of the precorrected FFT algorithm ............................. 12 3 EC M m olecule m eshes ......................................................................... 14 4 Defective meshes- small triangles ............................................................ 15 5 C ollap sing v ertices .............................................................................. 16 6 D iagonally oriented bar ..................................................................... 21 7 bar exam ple after re-orientation ............................................................... 23 8 A small example of meshes with 300 panels .............................................. 28 5 1 Introduction Electrostatic interactions play an important role in macromolecular systems, and therefore much effort has been devoted to accurate modeling and simulation of biomolecule electrostatics [1]. The computation of the strength of electrostatic interactions for a biomolecule in an electrolyte solution, as well as the potential that the molecule generates in space, is an essential element of this effort [1]. There are two purposes of this type of simulations. First, it improves the understanding of the relationships between electrostatics and stability, function, and molecular interactions. Second, it serves as a tool for molecular design because electrostatic complementarity is an important feature of interacting molecules [1]. One way to simulate a protein macromolecule in an aqueous solution with nonzero ionic strength is to adopt a mixed discrete-continuum approach based on combining a continuum description of the macromolecule and solvent with a discrete description of the atomic charges [1]. An efficient procedure to solve the mixed discrete-continuum model numerically is to combine a carefully chosen integral formulation of the mixed discrete-continuum model with a fast integral equation solver, which is the precorrectedFFT (pFFT) accelerated method [1]. 6 1.1 Mixed Discrete-Continuum Formulation One popular simplified model for biomolecule electrostatics is to approximate the interior of a protein molecule as a collection of point charges in a uniform dielectric material, where the dielectric constant is typically two to four times larger than the permittivity of free space [1]. Any surrounding solvent is modeled as a much higher permittivity electrolyte whose behavior is described by the Debye-Huckel theory [1]. The interface between the protein and the solvent is defined by determining how close the solvent molecules can approach the biomolecule [1]. As depicted in Figure 1, Region I corresponds to the interior of the protein and Region II corresponds to the surrounding solvent. The electrostatic behavior in Region I, the protein interior, is governed by a Poisson equation it, ()= - ( -(Region I) (1) where (p, is the electrostatic potential, F is an evaluation position, if is the location of the P'" protein point charge, q, is the point charge strength, n, is the number of point charges and e, is the dielectric constant in the protein interior. The electric potential in the solvent, Region II of Figure 1, is presumed to satisfy the Helmholtz equation [1] V 2P 2 (r)- K2 (P2() = 0 7 (Region II) (2) Molecular Q surface= Region II El E2 Region I...... n nt'Charges ro r Figure 1: The continuum, model of a solvated protein 1.2 Integral Equation Formulation We can start with the fundamental solutions to (1) and (2), respectively: G, (1;')= _ 41r -j -r G2 (e; ')= . 47r - r' 8 | (3) (4) These two solutions along with Green's second theorem can be used to produce integral equations for the potential and its normal derivative. The integral equations for Region I and II are the following, respectively: a')an -- (# p(F)= f[G (r; ' G(;]dIi'+ an G,1(F;) (5) 7= El Q2 TP2 (r) f [-G2 (r;') + T2V') 2 (6) ;F')]d' F')3G The potentials satisfy boundary conditions at the molecular surface Q : (PI())= (P2(F> ) a(p,() Dp 2 ) an an where (7) (8) , e Q and E = E2 / E, is the relative dielectric constant of the two regions. For the boundary conditions to be met, we take the limits of equations (5) and (6) as F -+ Q from the inside and outside of the molecular surface, respectively, and then we substitute the limits into equations (7) and (8). The limits are (p, (,) = im, () = [G, (r§; '7) Q2n = limP2 ( ) (9) -<(pI (V') aG,(;; ) ]d'+ an 2 + IG,; i=1 E, (10) and (P2 ()) =J[62(17;?) 2(K;' = [( 0 (11) P2 (r') an + T2(V') 9 3G2 (6;' an Fd'- ] 2 T2(F) (12) A pair of integral equations can be obtained after substituting (10) and (12) into (7) and (8), 1 f~(r 2 QL (P,(F,)+f I1(8,)+ 2 I(F;)rJ~ () - G, (Y, an a(F)G 2 (; , Q[ )- an r)dr f=$ Y f( n,__ G d= FO an Gj=; G,8;) 0) 1 (13) (3 0 (14) H E, E an Equations (13) and (14) are used to calculate (p, and 0 ' on Q , the molecular surface. an These surface potentials and their normal derivatives can then be used in (5), (6), (7), and (8) to calculate potentials at any location. In our biomolecule electrostatic simulation, potentials at the locations of the point charges are computed according to the following formula, [G (;'p,)(r)aGI(Ftjr' (p ()= an Q an ) ]di' (15) and an approximate of the solvation energy of the molecule is obtained by summing over the products of the point charges and their potentials. 1.3 Numerical Solution The surface is first discretized into a set of panels, and a piecewise constant basis function, B,., is associated with each panel [1]. The potentials are then represented as a weighted combination of the panel basis functions [1]. That is, p(,= a B ( ) (16) k ap, an XbkBk(7 I k 10 (17) The basis function weights are determined by insisting that when (16) and (17) are substituted for the potential and its normal derivative in (13) and (14), the resulting equations are exactly satisfied for those values of r, which correspond to panel centroids [1]. The resulting system of equations is as follows: j 1 2G Ial 3 G~d d G _. dr panelk a G, d d 2 -I G anen - bkn G2dr - '1 G E i G =1E0 (18) panelk Equation (18) can be solved by an iterative approach with a precorrected-FFT algorithm [1]. 1.4 Pfft++ In our implementation, we use the iterative solver GMRES [2], to solve the matrix equation (18). During each iteration of GMRES, four matrix-vector multiplications need to be carried out: F~~3 1 d1[a] f an d [k(I9a), fG,dr [ak panelk _panelk ii2 Ib (19b) dP l[ak] f d'an a(1 JG 2 dF][ak] 9c), (19d) panel _panel, These matrix-vector multiplications can be computed efficiently using pfft++, a general and extensible fast integral equation solver based on a pre-corrected FFT algorithm [3]. 11 The algorithm basically consists of four steps, as can be seen from Figure 2, where a given set of panels from a discretized surface are superimposed on a uniform grid [1]. Figure 2: A pictorial representation of the precorrected FFT algorithm kL projection interpolation Nearb: interaction convolution First, panel charges are projected onto their associated grid points, which are essentially the closest grid to the center of each panel [1]. This is called the projection step [1]. Second, given the distribution of grid charges, the potential on each grid can be computed using a convolution of the Green's function (the kernel) and the grid charges [1]. Since the Green's function is position invariant, the matrix corresponding to this convolution is a multilevel Toeplitz matrix [3]. The number of levels is 2 and 3 for 2D cases and 3D cases, respectively [3]. This convolution is basically a Toeplitz matrix vector product which can be computed efficiently using the Fast Fourier transform (FFT). Third, grid potentials are interpolated back onto the panels, a step called interpolation [1]. In the last 12 step, known as precorrection, nearby interactions are computed directly from the integral, and the contributions from the convolution step are removed for these nearby interactions [1]. 2 Input and output The input to our simulation consists of the following: 2.1 Molecular surface meshes These are the panels that make up of a discretized molecular surface as needed for our numerical solution. For a given molecule, we generate the molecular meshes by using the program MSMS [4], which takes as inputs the position and radius of each atom in a molecule and produces a discretized surface consisting of triangles. 2.2 Atom charges Each atom has a point charge situated at its center; these charges are the point charges q, that appear in the derivation. They are not necessarily the same even for the same atoms, as they are derived from quantum mechanical calculations [1]. 2.3 Dielectric constants and salt concentration The inner dielectric constant e, , the outer dielectric constant E2 , and the salt concentration in region II which determines K , need to be specified. Our simulation takes in these parameters and solves equation (6) for ak and b, , thus obtaining the potentials and their normal derivatives at the molecular surface from (4) 13 and (5). Then it computes the potential at each point charge and sums over the products of point charges and their potentials to obtain the solvation energy of the molecule as the output. The main molecule example used is an E. coli chorismate mutase (ECM) protein macromolecule with 3210 atoms [1]. MSMS is used to triangulate the molecular surface to produce the meshes, which looks like the following: 30 20. Figure 3: ECM molec ule meshes -10 20 10 20 - -10 0 In our test runs, the inner dielectric constant is always 4, and the outer dielectric constant is always 80. There are two cases for the salt concentration. One is called the no salt case, which corresponds to the salt concentration being zero, equivalently 14 K =0, and G2 (T;')= G, (r; ')= The other one is called the salt case, which 41crF - corresponds to the salt concentration being 0.145 M, equivalently K = 0.124 Angstrom-I at 25 degree C. The two cases differ mainly in that in the salt case, K # 0 and there is an exponential term that needs to be evaluated in G2 (; . . This difference V') 4ir Ir - r '| turns out to impact the performance and improvement of the simulation significantly as we will see later. 3 Defective Meshes The surface meshes generated by MSMS actually have some triangles with areas close to zero, and our simulation fails with these defective triangles. If we zoom in the figure shown above, we can see the existence of very small triangles: Figure 4: Defective meshessmall triangles One way to remove the small triangles is to, for each pair of vertices that are too close to each other, collapse one vertex onto the other and reshape the triangles associated with 15 the moved vertex. For example, in the following case vertices A and B are too close and we move A onto the location of B, and then get rid of A: A 7e %e A Figur e 5: C'"IIa psing vertices B) Vertex A Vertex B In our approach, we first calculate the mean distance between adjacent vertices, and then we obtain a threshold distance by multiplying the mean distance with a factor. We then collapse pairs of vertices whose distances are smaller than the threshold value. We then go start the procedure again by calculating the new mean distance. This process continues until there is no more vertex pair to be collapsed. Obviously this process distorts the original shape of the surface meshes so ideally the factor we use for the collapsing process should be as small as possible. For the ECM molecule and no salt case, if we use 0.01 as the factor, the resulting meshes have 161210 panels and the solvation energy is -642.013 kcal/mol. The program takes 36 minutes and 21.275 seconds to run for this case. If we use 0.1 as the factor, the resulting meshes have 38802 panels and the solvation energy is -664.275 kcal/mol. The program takes 3 minutes and 23.521 seconds to run. This processing of the surface meshes reduces the 16 running time to about 1 percent of the original amount with the cost of 4 percent error. Because our effort is mainly in speeding up the performance of our simulation, we use the molecular surface meshes after 0.1 factor processing as our test example for the sake of faster execution time. The following is a series of improvements made to speedup the simulation. The discussion and data given in each section are based on the assumption that previous improvements have been made. 4 Algorithm Inefficiencies 4.1 MaxBoundingSphereRadius We used the g++ profiler gprof command to profile our program for the ECM molecule no salt case, and we discovered that the majority of the time is spent on one particular function as can be seen in the report from gprof: (The total running time is 7 minutes and 41.092 seconds, or written more succinctly, 7m41.092s) Flat profile: Each sample counts as 0.00195312 seconds. % cumulative self self time seconds seconds calls ms/call 40.04 187.96 187.96 98318 1.91 total ms/call name 1.91 pfft::GridElement:: findMaxBoundingSphereRadius(int) const The above shows the first entry in the flat profile report. The flat profile report from gprof consists of a list of entries with one entry for each function, and the entries are listed in decreasing percentage time order. The first column of the entry indicates the percentage time spent in the function over the total execution time. The second column 17 shows the total number of seconds spent in this function and all functions appearing before it in the flat profile. The third column is the total time spent in the function, excluding the time spent in those functions called by this function. The fourth column shows the total number of calls to this function. The fifth column indicates the average time spent executing this function each time it is called, and this figure counts only the time spent executing the function itself and does not include any time spent executing any of the functions called by this function. The sixth column is the average time spent in this function each time it is called, and time spent executing functions called by this function is included. The last column is the name of the function. In the precorrection step of the pre-corrected FFT algorithm, we need to determine which panels are considered "nearby", so we can compute the nearby interactions. Because we associate each panel with a grid, we define two panels as nearby if their associated grids are nearby, and two grids are considered nearby if their distance is smaller than a certain threshold value. For each grid, we define a MaxBoundingSphereRadius, which is the radius at which a sphere centered at the grid can encompass all the panels associated with that grid. In the original algorithm, the greatest MaxBoundingSphereRadius is found among all the grids, and the threshold value for determine nearby grids is set to be twice the greatest MaxBoundingSphereRadius, and therefore the threshold value is constant for every grid. However, it is realized that each grid should have a different threshold value set to be twice the grid's MaxBoundingSphereRadius. Intuitively, the larger the panels associated with a grid is, the more grids should be considered nearby to that grid. On the other hand, if panels 18 associated with one grid are all small and clustered tightly around the grid, then only grids that are very close to this grid should be considered nearby. This is an inefficiency in the algorithm in that it might over estimate the number of nearby grids to one particular grid by using the greatest MaxBoundingSphereRadius among all grids. Also, a bug in the implementation is also found in that for each grid it finds this greatest MaxBoundingSphereRadius once so it repeats the process of going through all the panels multiple times (equal to the number of grids). After making a change in the algorithm to find the MaxBoundingSphereRadius for each grid and use it to determine the threshold value for finding nearby grids , this function "pfft::GridElement:: findMaxBoundingSphereRadius(int) const" disappears from the top list of timeconsuming functions, and the total running time drops to 4m39.996s: Flat profile: Each sample counts as 0.00195312 seconds. % cumulative self self time seconds seconds calls ms/call 17.00 48.03 48.03 80 600.42 7.84 7.22 6.23 5.38 70.18 90.58 108.18 123.38 22.15 20.40 17.60 15.20 5.30 138.37 14.99 5.01 4.41 3.73 152.53 164.99 175.53 14.16 12.46 10.54 total ms/call name 600.42 void pfft::Fast3DConv<...> ::operator()<... >(...) mcountinternal 1 20402.34 160654.80 pFFTbemNoSalt(...) __ieee754_pow 21594 0.70 0.87 pfll::DirectMat<...>:: precorrectRegularNeighbor(...) 136486073 0.00 0.00 pfft::vector3D<double> pfft::operator-<double>(... ) fftw_notwiddle_32 fftwinotwiddle_32 pow 4.2 pow and __ieee754_pow It is noticed that functions __ieee754_pow and pow also take up a lot of the execution time. These function calls to the pow function in the standard math.h library are used for 19 evaluating the polynomials needed in the projection and interpolation steps of pfft++ [3]. Because only floating point numbers raised to positive integral numbers are needed for computing the polynomials, calling the pow function is inefficient because pow uses logarithms to compute floating point numbers to floating point powers. An efficient way to calculate floating point number d to the positive integral power e is as follows (written in C code): function intpow returns d to the power of e, double intpow(double d, int e){ double product = 1.; double increment = d; while(e>O){ if ((e%2)!=O){ product =product * increment; e=e- 1; }else{ increment=increment*increment; e=e/2; } } This is an order O(log(e)) function. The basic idea is that if we write e in binary representation e = (en.. .e2 e1 )2 , where e, is either 0 or 1. d, d&0 )2 .. .d. 00 )2 We go from el to en, compuinge consecutively and store the value in the variable "increment". The variable "product" is our container for the final answer and it starts as 1. Whenever we hit some ei that is equal to 1, we multiply d(eO .0), which is saved in "increment", into the product and thus by the time we finish en we have carried out the whole computation. Replacing the original pow(..) function calls with calls to this function intpow(...) reduces the amount of time spent for executing the ECM no salt case from to 4m39.996s to 4m7.082s. 20 5 Rotating input meshes The orientation of tle input meshes sometimes affects the speed of the simulation greatly. In the following example, an artificial rectangular bar with 20101 panels is oriented diagonally. Our grids line up with the axes and need to cover the whole structure, and therefore the grids will occupy the whole bounding box. As we can see, the majority of the bounding box is empty and putting grids there is wasteful. 250 200 150 100 1- 50- 300 --- -50 -100 Figure 6: Diagonally oriented bar -0 -150 2 Because the orientation of the input meshes should not affect the simulation result, we can rotate the meshes to a certain orientation such that the volume of the bounding box of the bar is minimized, in hope of reducing the number of grids used (though in reality rotating the input meshes does affect the simulation result due to numerical 21 discrepancies). We use a program of which algorithm is described in [5] and the code available at http:/ /alis.cs.uiuc.edu -sariel 'researchi papers '00diameter diam prog.html to rotate the mpt meshes to an orientation at which the bounding box is approximately minimized. This is a heuristics program that takes in a set of points and output an approximately minimal rectangular box that bounds the point set. This program needs two input heuristics parameters, "grid-size" and 'samplesize", in the function: gdiam bbox gdiam-approx mvbb-grid sample( gdiampoint * start, int size, int grid-size, int sample-size ) where "*start" is a pointer to the point set, "size" is the number of points in the set. Roughly speaking, the higher these two heuristics parameters "grid size" and 'sample-size" are, the better the resulting approximate minimal bounding box is, but also the longer it takes to compute this approximate minimal bounding box. Smaller bounding box in general results in fewer grids and less execution time in the convolution step of pfft++, but it also takes longer to find an orientation that leads to smaller bounding box. Therefore, there is a trade-off between time spent in pfft++ convolution and the re-orientation step. In principle, one needs to optimize the two heuristics parameters "grid-size" and 'sample-size" to achieve minimal overall execution time. To demonstrate that re-orienting input meshes does give significant improvement in execution time, we compare the performance of the original program with that of the one which incorporates the re-orientation step. With the diagonally oriented bar of 20101 panels as input, we find the optimized "gridsize" and 'samplesize" to be 5 and 270. We run the simulation with no salt case, the original program produces an array of grids that have 128 grids in x,y, and z directions, and it spends 10m36.473s in total and 7m16.64s is 22 spent in the convolution step. For the program with re-orientating input meshes, the simulation takes only 3m38.902s and iml.57s is spent in the convolution step. The reoriented input meshes looks like the following and apparently it has much smaller bounding box size because it is very thin in one direction (note that it is not minimal, because finding minimal bounding box will be too time-consuming compared to the amount of time that can be saved in the convolution step). The array of grids produced for the re-oriented meshes have (256,8,256) grids in the x, y, and z directions. That is, only one fourth of the original total grid numbers. 0J Figure 7: bar example after re-orientation -20 _40 .2W /4 0 -4 6 Intel Compiler As pointed out earlier, when simulating for the salt case, the exponential term in (4) would need to be computed, and this actually takes up a significant amount of execution time. For the ECM with salt case, the most time-consuming function is the 23 __ieee754_exp function, which consumes 48.63 seconds out of the total running time 8m53.934s. Because the Intel® C++ Compiler 7.1 is known to produce faster programs than gcc, especially for mathematical functions such as exponential, we compiler our code with the Intel Compiler in hope of speeding up our simulation. The resulting program spends 8m36.186s on the simulation, a slight improvement over the original one. Although ieee754_exp function is not dominant time-consuming function for the new program, other functions show up on the top list of time-consuming functions, most noticeably "std::vector<... >::size() const" (37.52 seconds) and "std::vector<...>::begin( const" (17.15 seconds). This indicates that the optimization "inline function expansion" may not be turned on and as a result function calls to the standard library take up much time. 6.1 Interprocedural Optimizations According to the Intel® C++ Compiler User's Guide, one can use -ip and -ipo flags to enable interprocedural optimizations(IPO), which include optimizations such as inline function expansion, interproceduaral constant propagation, monitoring module -level static variables, dead code elimination, propagation of function characteristics, and multifile optimizations (which basically does all the above optimizations across modules, and needs another flag -ipo-obj to enable in our case) [6]. After enabling this optimization, the total running time drops from 8m36.186s to 8m23.436s and the time spent in "std::vector<... >::size() const" drops from 37.52s to 35.96s, and the time spent and "std::vector<.. .>::begino const" becomes negligible. 24 6.2 Vectorization The Intel Compiler also provides an optimization called vectorization, which is carried out by a component of the compiler called the vectorizer, that uses SIMD instructions in the MMXTM, SSE, and SSE2 instruction sets [6]. According to the user's guide, the vectorizer "detects operations in the program that can be done in parallel, and then converts the sequential program to process2, 4, 8, or 16 elements in one operation, depending on the data type" [6]. As an example, the following is a piece of code that can be vectorized [6]: i=0; while (i<n) { /*original loop code */ a[i] = b[i] + c[i]; i++; After being vectorized , the code is changed to be the following , reducing the number of assignments that need to be done by 4: /* the vectorizer generates the following two loops */ i=0; while (i< (n-n%4)) 25 { /* vector strip- mined loop */ /* subscript [i:i+3] denotes SIMD execution / a[i:i+3] = b[i:i:+3] + c[i:i+3]; + 4; while (i < n) { /* scalar clean-up loop */ a[i] = b[i] + c[i] } Double precision types is not accepted for vectorization unless using a Pentium@ 4 processor system, on which our simulation runs on, and in this case -xW compiler option needs to be used [6]. After enabling this optimization, the execution time of our simulation with the ECM salt case drops from 8m23.436s to 7m45.025s. The performance of most functions are improved, though there is no function that enjoys a particularly great amount of improvement. Noticeably, "std::vector<... >::size() const" is still the second most time-consuming function up until this stage, spending 38.24s out of the total 7m45.025s. 6.3 Profile-guided Optimizations This intel compiler optimization option is a two-stage process. At the first stage, the compiler produces a program that when executed generates a file that contains 26 information about which areas of the application are most frequently executed [6]. This test run usually takes longer to execute because it needs to collect data during execution. At the second stage, the compiler uses that file to generate a program that is optimized using the information cortained in the data [6]. The improvement of this optimization is affected by the type of input that is used to do the test run at the first stage. If at the first stage, we run the program with ECM no salt, the test run takes 26m21.076s and the resulting program at the second stage does the simulation for the ECM salt case in 8ml4.928s (which is worse than not doing this optimization at all). However, if we do the test run with the ECM salt case, the test run takes 45m27.773s and the resulting program finishes the execution of the ECM salt case in 7m5.971s. Peculiarly, if we put both data files generated during the two separate test runs into the directory where the compiler retrieves profile data, it will actually use both data files to produce a faster program that in our case runs the simulation of the ECM salt case in 6m55.332s. Because the test runs take so long to finish for these cases, one may question the feasibility of this optimization. We can take a smaller example to do the test runs. If we use the meshes (with 300 panels) shown in the figure below, then the test run finishes in 0m8.059s for the no salt case and 0m13.379s for the salt case. The resulting program runs the ECM salt case in 7m10.961s, which is, though not as good as the 6m55.332s obtained from using ECM as test run input, still better than 7m45.025s obtained without using the profile-guided optimization and saves a lot of test runs execution time. 27 Figure 8: A small example of meshes with 300 panels 25 The performance of most functions is improved by the profile-guided optimization. The most dramatic change is that after this optimization the execution time of the function "std::vector<... >::size const" becomes negligible, and this is where the majority of the improvement comes from. 7 Future Work The most time-consuming step of the simulation is the convolution step, taking up 43.58% of the total execution time for the no salt case and 41.7% for the salt case. As described earlier, the convolution step is basically doing a multi- level Toeplitz matrix vector product. For a grid array with m,n, and k grids in the x,y,and z directions respectively, the matrix vector product can be carried out by expanding the three- level Toeplitz matrix into a three- level circulant matrix with the size of each level being 2m, 2n, and 2k, and half of the entries in the circulant matrix being zero [7]. The matrix vector product can thus be computed using a size (2m)*(2n)*(2k) 3D FFT+IFFT pair. In 28 theory, each size N 3D FFT+IFFT pair takes 5*N*log(N) floating point operations (flops), where the logarithm is based 2. However ,because half of the entries in the threelevel circulant matrix are zero, the actual number of flops needed is 5*N*log(N)/2 in our case. Therefore, for a grid array with m, n, and k grids in the x, ,y , and z directions, the convolution step would spend 5*(2m)*(2n)*(2k)/2*log[(2m)*(2n)*(2k)] flops in doing FFT+IFFT pairs. For the ECM salt case, 66.80% of the time spent in the convolution step is computing FFT+IFFT pairs, taking up 59.14 seconds in total. Therefore, one may wonder if there is potential improvement on computing FFT+IFFT pairs. We use the fftw library [8] to do the FFT+IFFT pairs in our program. Therefore, it is worthwhile to compare the efficiency at which we compute the flops in the FFT+IFFT pairs withthe announced efficiency of the fftw library. For the ECM no salt case, the number of grids in the x, y, and z directions are (in, n, k)=(64,64,128), which corresponds to a 3D transform size of 128*128*256. The convolution step is computed 78 times (multiplicities due to GMRES iterations and multiple matrix vector products needed to be computed in equation (19)), so the total number of flops computed is 5*78*128*128*256*log(128*128*256)/2= 17994 Mflops Divide that by the execution time 59.14 seconds, we obtain 304.25 Mflops/second efficiency. We are using an Intel® 4 processor with 1994 MHZ clock rate that is capable of doing 2 flops per cycle, so the maximum efficiency is 3988 Mflops/second. It seems like we are only achieving 304.25/3988 = 7.63% of the efficiency, but it is actually not too far off from the announced efficiency of fftw. For Pentium II at 300 MHz, fftw achieves about 60 Mflops/second for a 3D FFT+IFFT pair of size 128*128*128 [9], which is only 10% of the maximum efficiency = 300 MHz * 2 flops/cycle = 600 29 Mflops/second. The efficiency goes down in general as the size goes up because memory access dominates over floating point operations for large transforms [10]. Therefore, we might expect the efficiency for the size of our problem (128*128*256) to be lower than 10%, and 7.63& achieved in our program seems reasonable. Since fftw is one of the fastest library available, there does not seem to be much potential in improving the computation of FFT±IFFT pairs. However, there is still a significant amount of time (14.47% of the total execution time for the ECM no salt case) spent in other parts of the matrix vector product algorithm such as expanding the Toeplitz matrix into circulant matrix, padding zeros to the input vector to make it of the appropriate size, and retrieving the Toeplitz matrix vector product from the circulant matrix vector product. Because only programs written in certain ways can be properly vectorized by the Intel Compiler [6] and the currently implementation may not have been done in such ways, there is potential in speeding up this part of the simulation by rewriting the code to take advantage of the vectorization optimization provided by the Intel Compiler. 8 Conclusion Methods for improving the performance of a biomolecule electrostatic simulator have been presented and the speed has been more than doubled (for the ECM no salt case, the execution time drops from 7m41.092s for the original implementation to 3m23.521 s for the profile- guided optimized implenrntation with both salt and no salt cases as test un inputs). In extreme cases such as the diagonally oriented bar example described in section 5, another factor of 3 improvement is possible. Speed plays an important role in 30 the simulation because as the size of the molecule grows large r and as the number of panels increase for higher precision, the execution time will rise inevitably. In order for our simulation to have greater applicability to areas such as drug design, faster implementation and faster algorithms will always be pursued. Bibliography [1] "Fast Methods for Simulation of Biomolecule Electrostatic", Shihhsien Kuo, et al. ICCAD 2002 [2] Y. Saad and M. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journalof Scientific and Statistical Computing, 7:856=869, 1986. [3] Zhenhai Zhu, Master Thesis, "Efficient Techniquesfor Wideband Impedance Extraction of Complex 3-D Structures", Chapter 4, EECS Department, MIT, August 2002 [4] M. Sanner, A. J. Olson, and J. C. Spehner. Reduced surface: An efficient way to compute molecular surfaces. Biopolymers, 38:305-320, 1996 31 [5] Gill Barequet , Sariel Har-Peled, Efficiently approximating the minimum- volume bounding box of a point set in three dimensions, Journal of Algorithms, v.38 n. 1, p.91- 109, Jan. 2001 [6] Intel® C++ Compiler User's Guide [7] G. H. Golub and C. F. Van Loan, Matrix Computation, The Johns Hopkins University Press, second edition, 1989 [8] htlp://ww\w. fftw.org/ [9] http://www.fftw.org/benchfft/results/pii-300.html [10] http://www.fftw.org/benchfft/doc/methodology.htm 32