CUDA-based Triangulations of Convolution Molecular Surfaces Sérgio Dias Kuldeep Bora Abel Gomes Universidade da Beira Interior Departamento de Informática 6201-001 Covilhã Portugal Indian Institute of Technology Department of Computer Science and Engineering 781039 Guwahati, Assam India Universidade da Beira Interior Departamento de Informática 6201-001 Covilhã Portugal sdias@ubi.pt kuldeep@iitg.ernet.in agomes@di.ubi.pt ABSTRACT 1. Computing molecular surfaces is important to measure areas and volumes of molecules, as well as to infer useful information about interactions with other molecules. Over the years many algorithms have been developed to triangulate and to render molecular surfaces. However, triangulation algorithms usually are very expensive in terms of memory storage and time performance, and thus far from real-time performance. Fortunately, the massive computational power of the new generation of low-cost GPUs opens up an opportunity window to solve these problems: real-time performance and cheap computing commodities. This paper just presents a GPU-based algorithm to speed up the triangulation and rendering of molecular surfaces using CUDA. Our triangulation algorithm for molecular surfaces is based on a multi-threaded, parallel version of the Marching Cubes (MC) algorithm. However, the input of our algorithm is not the volume dataset of a given molecule as usual for Marching Cubes, but the atom centers provided by the PDB file of such a molecule. We also carry out a study that compares a serial version (CPU) and a parallel version (GPU) of the MC algorithm in triangulating molecular surfaces as a way to understand how real-time rendering of molecular surfaces can be achieved in the future. Roughly speaking, a molecule is a set of atoms. Computergenerated representations of molecules go back to Levinthal’s work in 1966, who first used the stick model (i.e. a line segment for each bond) to display the 3D structure of molecules on screen [15] . But, as Levinthal noted, to understand the interactions between biomolecules, the location of the molecular surface plays a more important role than the location of the bonds. In order to better understand the properties of biomolecules and their interactions, we need more sophisticated 3D models of biomolecules capable of providing us information concerning volumes, areas, shape matching data between biomolecules, and so forth. There are three popular mathematical surface models for biomolecules in computational biology and biochemistry, namely: Categories and Subject Descriptors I.3.5 [Computer Graphics ]: Computational Geometry and Object Modeling—Curve, surface, solid, and object representations; Geometric algorithms, languages, and systems; J.3 [Life and Medical Sicences]: Biology and genetics; J.2 [Physical Sciences and Engineering]: Chemistry; I.3.1 [Computer Graphics]: Hardware Architecture—Graphics processors General Terms Algorithms, Experimentation, Measurement, Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'10, June 20–25, 2010, Chicago, Illinois, USA. Copyright 2010 ACM 978-1-60558-942-8/10/06 ...$10.00. 531 INTRODUCTION • The van der Waals surface. The van der Waals (VDW) surface is just the boundary of the union of (solid) balls, each representing an atom. • The Lee-Richards Surface. The Lee-Richards surface is also known as the solvent-accessible surface (SAS) [14]. This surface considers the effect of the solvent (e.g. water) on the molecule. For water, we frequently use a solvent sphere radius of 1.4 A◦ . The Lee-Richards surface is the VDW surface with the van der Waals radii of atoms inflated by the solvent sphere radius. • The Richards-Connolly Surface. This surface is also known as the solvent-excluded surface (SES), or simply the Connolly surface [7], and also considers the effect of the solvent probe sphere. In 1977, Richards defined this molecular surface as being made up of two parts: the contact surface and the reentrant surface [20]. The contact surface is compounded from the spherical surface patches of the atoms that are acessible to or in contact with the probe sphere; consequently, they are convex. The reentrant surface comes from the inward-facing surface of the solvent probe sphere when it is simultaneously in contact with two or more atoms [6]. The reentrant surface includes saddle toroidal patches and concave spherical patches. Toroidal patches are fillets tangent to two atom spheres formed when the probe sphere rolls in contact with the two atoms over their circular intersections. The concave spherical patches are the regions where the probe is in contact with three atoms simultaneously. Following this definition of Richards’ molecular surface, Connolly presented a method for calculating it [6]; hence the Richards-Connolly surface. The precise definition of the outer surface of a macromolecule is important to study the structure and function of proteins and nucleic acids because it is the surface of the molecule that binds ligands and other macromolecules. For example, drug-nucleid acid interactions require a formal geometrical representation of molecular surfaces. For small molecules, both van der Waals surface and Lee-Richards surface provide adequate representations of the outer surface. But, for large molecules, part of these surfaces is buried in the interior. This fact led Richards and Connolly to define and construct a new surface model for biomolecules, the so-called Connolly surface [6]. Another problem with the van der Waals surface and its inflated variant, i.e. the Lee-Richards surface, is that they both present sharp crevices where atoms intersect, what poses some difficulties in predicting the shape and possible functions of molecules. The Connolly surface constitutes a first attempt to construct a smooth molecular surface. However, Connolly’s method of computing the molecular surface suffers from two defficiencies. First, it produces a selfintersecting surface [6], from which we get the Connolly surface after removing the redundant self-intersecting patches. Second, the Connolly surface generated in this manner is not smooth because it has abrupt changes between positive and negative curvature, and some points of the surface are singular [23]. The first smooth (i.e. infinitely differentiable) molecular surface was proposed by Blinn [3]. At that time, rendering the Blinn surface was carried out on a pixel-to-pixel basis, that is, for each pixel, the nearest point on the surface was calculated using a Newton iterator, being the normal vector determined from the gradient of the surface-defining function. That is, Blinn used a kind of ray caster to render molecular surfaces. In essence, the Blinn surface is a convolution surface of the atom centers of the molecule. The purpose of this paper is just to describe a new GPUspecific algorithm to triangulate and render Blinn surfaces. This algorithm has led to a CUDA-based parallel implementation of Marching Cubes (MC) for molecules. The main novelty of this algorithm lies in the design of a parallelized MC algorithm specifically for molecules. Besides, we carry out a comparative study of the time performance between CPU- and GPU-based triangulation algorithms to render molecular surfaces. This paper is organized as follows. Section 2 briefly describes the related work. Section 3 introduces the mathematical formulation of smooth molecular surfaces. Section 4 briefly presents the CUDA technology. Section 5 describes our CUDA-based triangulation algorithm for Blinn surfaces. Section 6 carries out a time performance analysis and comparison between CPU- and GPU-based marching cubes algorithms. Finally, Section 7 summarizes the main results obtained with our CPU- and GPU-based algorithms, and points out directions for future work. 532 2. RELATED WORK Depending on the input data, there are two categories of MC-based triangulation algorithms for molecular surfaces: • Volume datasets. This the classical input for the MC algorithm, which consists of a sliced grid of 3D pixels or voxels. In this case, given a threshold, we simply apply the MC algorithm to every single cube defined by 8 voxels, i.e., the digital vertices of an imaginary cube of the volume dataset, as usual for biomedical data sets (e.g., a skull data set) in order to triangulate the unknown surface within such a cube. Unexpectedly, we have only found two representative algorithms of this category for molecular surfaces in the literature; the first is due to Heiden et al. [12], whereas the second as written by Tang and Newman [21]. See Lawrence and Bourke [13] for similar approach to generate electron density isosurfaces using MCs. • Atom centers. Here, we start with the set of atom centers of a molecule, not with a discretized volume data set. This means that the regularly spaced 3D grid of voxels (now vertices) has to be calculated before entering the MC algorithm. Merelli et al. use this technique to extract and render van der Waals surfaces, as well Lee-Richards surfaces [17], while D’Agostino et al. use similar approach to extract Connolly surfaces [8]. Can et al. propose a MC-based method that provides a framework for computing van der Waals, Lee-Richards, and Connolly surfaces using a kind of distance field using level-sets [5]. Yu et al. also use centers and atom radii as inputs for their algorithm to render Blinn’s molecular surfaces with a Gaussian kernel [25]. Obviously, there are several parallel implementations of the MC algorithm. At our best knowledge, there are only a few implementations of MCs using shaders that run completely on the GPU side; see Geiss [10], Uralsky [22], and [9]. Both use the SM4 (Shader Model 4). This was made possible because, unlike the previous shader models, SM4 admits the generation of triangles on the GPU side. Interestingly, most parallelized isosurfacing algorithms use the SIMD (Single Instruction, Multiple Data) architecture and also the MIMD (Multiple Instructions, Multiple Data) architecture. See, for example, Newman et al. [18] for a good chunk of references about this topic. Most these parallelized MC algorithms fall into the volume dataset category. The only parallelized algorithms whose input is a set of ball (or atom) centers are the ones due to Uralsky [22] and D’Agostino et al. [8]. The first uses SM4, while the uses SIMD. The input of our SIMD-based MC algorithm also uses the set of atoms of a given molecule but, unlike others found in the literature, it uses CUDA programming model from Nvidia. It renders Blinn’s molecular surfaces using different atomic kernels. Note that Nvidia already provides a CUDA SDK code sample for Marching Cubes that extracts an isosurface from a volume dataset; consequently, it does not work for extracting a molecular surface from its set of atom centers. However, these two CUDA-based MC algorithms share the second, third and forth kernels described in Section 5. 3. SMOOTH MOLECULAR SURFACES A molecule is a set of atoms that can be mathematically described as a union of balls, each representing an atom. In this paper, we are interested in analytical formulations of molecular surfaces that result from summing up local functions that describe the electrical field of atoms. This is so because each atom has its own electrical field, which can be described by an analytical function that decays with the distance. It is clear that there several functions that can describe the electrical field of an atom, namely: • Gaussian function [3]. fi (x, y, z) = ae−br 2 (1) where b is the standard deviation of Gaussian curve, a is the height of Gaussian curve, and r is the distance to the atom center. • Soft object function [24]. ( 6 4 a(1 − 4r + 17r − 9b6 9b4 fi (x, y, z) = 0 r>b 22r 2 ) 9b2 r≤b (2) where b is the standard deviation of the curve, a is the height of the curve, and r is the distance to the center of atom i. • Inverse squared function. (3) where C stands for the smoothness or blobiness parameter and ri = (x − xi )2 + (y − yi )2 + (z − zi )2 is the squared distance from the center (xi , yi , zi ) of the atom i to a generic point (x, y, z). Typically, the parameter C takes on a value in the interval ]0, 1]. Thus, the electrical field of each atom in the molecule is formulated as a distance function, i.e. an implicit scalar function. The molecular surface is the result of the summation of the distance functions associated to all atoms, i.e. the summation of electric fields of all the atoms, as follows: N X 4. fi (x, y, z) (4) i=0 where N is the total number of atoms of the molecule. The atomic local functions fi work as blending functions of the resulting molecular surface, which will be hopefully smooth. The smoothness of the molecular surface depends on the smoothness of the blending functions. Thus, the blending functions must be differentiable. A surface with this characteristics is referred to as convolution surface [4]. A molecular surface formulated in this manner is an implicit surface that is defined by a function F with domain R3 and range R. This implicit molecular surface is given by all the 533 GPU PROGRAMMING USING CUDA We took the decision of running our program on GPU because our algorithm fits very well in the parallel computing model. In fact, the voxelization of the axis-aligned bounding box that encloses the molecule allows us to speed up the program by assigning a thread to each cube of the bounding box. So, before proceeding any further, let us first see important details related to CUDA technology and GPU programming. 4.1 CUDA Architecture CUDA (Compute Unified Device Architecture) gives dataintensive applications access to the tremendous processing power of Nvidia graphics processing units (GPUs) through a revolutionary computing architecture unleashing entirely new capabilities [19]. In our work, we have used a Nvidia GeForce GTX 280 graphics card. This graphics card includes 10 thread processing clusters (TPCs), where each TPC consists of 3 streaming multiprocessors (SMs), with each SM having 8 stream processors, which yields 240 cores in total. Each core processes threads; hence the parallel computing model on GPU. To control and distribute the massive computing load by the 240 cores, we use a thread scheduler to guarantee a nearly full utilization of the GPU [2]. 4.2 C fi (x, y, z) = ri F (x, y, z) = points in R3 that satisfy the function F (x, y, z) = T , where T is the isovalue. That is, it corresponds to the level set defined by the constant T . Thus, F > T for points outside the surface, and F < T for points inside the surface, and F = T for points on the surface. CUDA Programming Model When we are programming in CUDA we must take care of how a program is going to be executed. In this paper, a CUDA program is written in C, in which we have CPU-specific functions and GPU-specific functions. A CPUspecific function runs on the CPU (host), while a GPUspecific function is executed on the GPU (device) as shown in Figure 1. There are three types of GPU-specific qualifiers. The first, named __global__, acts as prefix of functions that are GPU kernels, that is, GPU functions that are invoked from CPU-specific code. The second, named __device__, qualifies a GPU function that can be only called from a GPU kernel. Finally, the third qualifier is the keyword __shared__ that acts as a prefix of a variable allocated in the streaming multiprocessors shared memory. Note that GPU-specific functions are prohibited to be recursive. There are other three important CUDA functions to (dis)allocate memory on the GPU side and to exchange data between CPU and GPU. The cudaMalloc is used to allocate memory on GPU, while the cudaFree function releases memory space from GPU. The cudaMemcpy function serves to transfer data from the CPU side to GPU side, and viceversa. Note that a kernel is a GPU-specific function that launches threads, hierarchically organized into grid–blocks– threads [1, 19], as illustrated in Figure 1. The compilation of a CUDA program is done in three stages. The first extracts the CPU code from the CUDA file into an intermediate file that is then passed to the standard C compiler. Next, the remaining code, that is the GPU code, it is set to 0. After setting the flag bits for the each cube, the algorithm uses the lookup table to determine which of the 256 cases of surface fits inside the cube. When this is done, the algorithm goes a step forward to interpolate the surface vertices, for each cube along the appropriate edges, by using the lookup tables. After computing the surface vertices for all cubes, we are ready to generate the triangles inside each cube, which together form the the final mesh that approximates the molecular surface. It is clear that rendering this surface mesh requires the calculation of the triangle normals. The reader is referred to Lorensen and Cline [16] for further details on MCs. 5.2 GPU Implementation We use CUDA to implement a variant of Marching Cubes (MC) algorithm for molecular surfaces. All MC computations are carried out on the GPU. The algorithm can be described as follows: Figure 1: GPU architecture [2]. is converted to a PTX file (i.e., a kind of assembly language). The third stage translates this PTX file into GPU-specific commands and encapsulates them in an executable file. 5. TRIANGULATING BLINN SURFACES Amongst all the available triangulation algorithms used to render molecular surfaces, the Marching Cubes algorithm is possibly the most adequate for GPU processing. This is justified by the fact that MCs lead to an uniform space partition of the bounding box into cubes (defined by 8 vertices or voxel centers), which is ideal for the design of a possible parallel implementation on GPU. 5.1 Marching Cubes: Overview This algorithm starts by dividing the space that contains the molecule into an uniform grid of cubes of appropriate size. The next step is the calculation of the electrical field intensity F at all cube vertices (or voxels) of the grid (Equation 4). Doing this involves very time-consuming computations, due to the fact that molecules may have long chains of atoms, and the spatial distribution of these atoms often requires very large grids, even for molecules with a small number of atoms. To deal with this problem, our algorithm calculates the intensity in a per atom charge fashion, instead of calculating it at each grid vertex directly. Since the influence of the field can be neglected beyond some appropriate distance, this optimization can be used in general for intensity calculations. Next, we calculate the position of the cube for each atom, and then we define a sub-grid centered on such an atom/cube. The intensity contribution of this atom adds to the current intensity of each sub-grid vertex. Terminated the computation of the intensity F for all grid vertices, we have to determine the 8-bit flag for each cube. This flag characterises the surface inside each cube. The flag bits form an index to a lookup table to determine which edges of each cube intersect the isosurface. If F ≥ T at a grid vertex, the corresponding flag bit is set to 1; otherwise 534 • Bounding Box Computation (CPU side). We first determine the axis-aligned bounding box that encloses a given molecule is determined while reading in the centers of atoms from a PDB file into an 1-dimensional array of size M , where M is the number of atoms of the molecule. The computation of the bounding box means here computing the locations of the opposite vertices of the diagonal of the bounding box. • Cube Decomposition (CPU side). Next, we determine the number N = I × J × K of cubes with a predefined size, where I, J, and K stand for the number of cubes along x-, y-, and z-axis, respectively, that fit the bounding box. • Allocation and Copying of Data Structures to GPU. This step involves the allocation of GPU memory for ten 1-dimensional arrays, namely the array of M atomic centers and the array of N integers, one integer (identifier) for each cube for ahead association between each voxel to a single thread. These allocation and copy operations are carried out by calling the cudaMallloc and cudaMemcpy functions from the CPU side. These two functions, together with the cudaBindTexture function, are also used to allocate and copy the lookup tables of the MC algorithm to GPU texture memory because these tables remain unchanged. The previous array of N integers serves the purpose of storing the number of vertices of the triangulation of the molecular surface patch inside each cube, as needed in a later step. • Computation of Electric Field Intensities at Voxels (1st GPU kernel). This kernel computes the intensity of electric field at each voxel (i.e., cube vertex); this value is initialized to zero at each voxel. Because the number of cubes largely exceeds the number of atoms, the intensity computations are carried out per atom rather than per cube. This is reinforced by the fact that some biological molecules have long separate chains of carbon atoms (see Figure 6(n), which implies that the spatial distribution of these atoms originates very large bounding boxes even for molecules with a small number of atoms. The per-atom intensity computation only occurs within a boxed neighborhood around each atom, i.e. at the voxels surrounding each atom center. The local box centered at each atom has usually 20×20×20 voxels. Thus, taking into consideration the Gaussian-like behavior of the local function fi of the atom i, we are assuming that fi takes on the value 0 beyond surrounding box of the atom. In other words, the intensity contribution fi of the atom i at each voxel of its surrounding box is added to the current intensity value F of such voxel. The parallelization of this intensity computation is accomplished by running a single-thread grid for each atom, so we end up having N threads for N atoms running simultaneously. the intermediate mapping array is necessary because the threads run in parallel. • Computation of Surface Vertices by Linear Interpolation (5th GPU kernel). This GPU kernel uses linear interpolation along the cube edges that intersect the molecular surface in order to sample the surface. The lookup tables are here useful to tell us which are the edges that intersect the surface. The resulting sampling points of the surface will be the triangle vertices of the mesh that approximates the surface. These surface points are computed for each cube and are stored into the VBO array using the previous mapping array. • Computation of Surface Normals at the Vertices in VBO Array (6th GPU kernel). Taking into account that the vertices stored in the VBO array are points of the molecular surface, the normal at a given surface point is given by the gradient vector as follows: • Computation of MC Configurations for Cubes (2nd GPU kernel). After computing the intensities F for all voxels, we determine the 8-bit flag —a bit per voxel— that corresponds to 1 out of 256 MC surface patterns we may find inside each cube. A bit has the value 0 if the corresponding voxel has an intensity under the pre-defined isovalue, and 1 if the intensity is over the isovalue. This 8-bit flag works as an index to a lookup table for determining which edges of the cube intersect with molecular isosurface. The parallelization of this process is done running a number of simultaneous grids with 128 threads, where each thread computes the 8bit flag of a single cube. The number of grids depends on the number of cubes of the bounding box enclosing the molecule. The flags associated to the cubes are stored into a specific 1-array. Each flag is used to fill in the aforementioned array of N integers, which stores the number of vertices of the triangulation inside the corresponding cube. • Allocation of VBO Array for Posterior Rendering of Surface Mesh (3rd GPU kernel). By performing a Parallel Prefix Sum (scan) [11] of the number of mesh vertices in the array of N integers of the previous step, we determine the total number of vertices that sample the molecular surface, from which we triangulate the surface. Recall that this triangulation will be done locally inside each cube, as usual in the MC algorithm. This Parallel Prefix Sum operation is carried by calling the function cudppScan, which is essentially another GPU kernel. The number of vertices output by cudppScan is then used to allocate a VBO (Vertex Buffer Objects) array for the mesh vertices, from which we can later generate triangles to render the mesh that approximates the molecular surface. • Setting up a Mapping Array Between Array of N Integers and VBO Array (4th GPU kernel). This kernel creates an intermediate array of N elements between the array of N integers —recall that each integer yields the number of vertices on the surface crossing each cube— and the corresponding VBO array, which allows to map a given cube (and its surface vertices) to its first surface vertex in the VBO array. That is, for each cube, the mapping array stores the index of the VBO array element where we will later store the first surface vertex associated with such a cube. This allows to create an association between the occupied cubes (e.g. cubes transverse to the molecular surface) and the correct position in the VBO array. Note that 535 ∇F = ( ∂F ∂F ∂F , , ) ∂x ∂y ∂z (5) where N X −2 C (x − xi ) ∂F = 2 2 2 2 ∂x [(x − x i ) + (y − yi ) + (z − zi ) ] i=1 N X −2 C (y − yi ) ∂F = 2 2 2 2 ∂y [(x − x i ) + (y − yi ) + (z − zi ) ] i=1 and N X −2 C (z − zi ) ∂F = 2 2 2 2 ∂z [(x − x i ) + (y − yi ) + (z − zi ) ] i=1 These normals are stored in a separate array. • Rendering the Molecular Surface (CPU side). Taking into consideration that the VBO array in the GPU can be accessed from the CPU side, we have only to display the VBOs to render the surface. We use Gouraud shading that comes with OpenGL. 6. RESULTS In our tests, we have used a Windows XP PC equipped with a Quad Core Q9550 running at 2.83 GHz, and 4GB RAM, and a Nvidia GeForce GTX 280 CUDA-programmable graphics card. We have written two programs for the same algorithm: • CPU-based Sequential Program. This program was written in C++, and makes usage of the STL (Standard Template Library). • CUDA-based Parallel Program. This program was written in C, and makes usage of the CUDA API v2.0. Both programs use single precision floating-point numbers. We have used 16 different molecules, read in from the corresponding .pdb files, to compare the time performance of both algorithms, as shown in Tables 1 and 2. These .pdb files are ASCII files and were taken from the Protein Data Bank (PDB) at www.rcsb.org, which is a repository for 3D structural data of molecules. Each molecule has a unique ID (see first column of both Tables 1 and 2 ). The time performance results output by both programs are presented in Tables 1 and 2. In both tables we have columns for the number of atoms (# Atoms) and the number of cubes (# Cubes). The remaining columns concern the time performance (in seconds) of the algorithm using 3 different local blending functions. 6.1 CPU Performance As shown in Table 1, the inverse squared function is computationally less expensive than the other two blending functions, because it involves a less number of arithmetic operations. In addition, the soft object function is better than the Gaussian function because it is a truncated Taylor expansion of the exponential used in the Gaussian function. ID # Gaussian Soft Object Inverse squared # Atoms Function Function Function Cubes (sec) (sec) (sec) 110D 120 0.68 0.45 0.43 346580 200D 259 1.90 1.40 1.20 603520 1QL1 322 2.79 2.14 1.90 1581370 4PTI 381 4.89 3.70 3.00 1043464 1BK2 468 4.89 3.70 3.00 650160 2QZF 479 5.55 4.30 3.60 3536664 2QZD 507 6.10 4.70 4.00 3860800 2OT5 545 6.00 4.80 4.14 1154032 1HH0 691 6.40 5.50 4.46 1632000 1HGV 691 6.56 5.50 4.57 1970752 1HGZ 691 6.30 5.45 4.36 1605760 2INS 781 11.10 9.12 7.30 1316250 1QL2 966 16.10 13.90 11.50 4619920 1IZH 1521 37.40 31.20 25.10 2578680 1GT0 2746 102.00 94.14 67.87 5681590 1BIJ 4387 250.50 240.20 179.10 7229120 ID # Gaussian Soft Object Inverse squared # Atoms Function Function Function Cubes (sec) (sec) (sec) 110D 120 0.13 0.13 0.13 352000 200D 259 0.27 0.27 0.26 610176 1QL1 322 0.38 0.37 0.37 1600896 4PTI 381 0.41 0.40 0.40 1053440 1BK2 468 0.36 0.35 0.35 657920 2QZF 479 0.88 0.72 0.72 3562624 2QZD 507 0.94 0.93 0.94 3901824 2OT5 545 0.56 0.53 0.51 1167360 1HH0 691 0.73 0.69 0.67 1652608 1HGV 691 0.89 0.74 0.70 1994752 1HGZ 691 0.68 0.65 0.64 1626112 2INS 781 0.84 0.82 0.82 1326976 1QL2 966 2.10 2.10 2.00 4654208 1IZH 1521 2.40 2.22 2.15 2601472 1GT0 2746 7.52 7.54 6.90 5720832 1BIJ 4387 15.00 14.40 14.20 7267456 Table 2: GPU times (in seconds). (Bx × By × Bz) + (By × Bz) + (Bz) the corresponding indexing formula, where Bx, By, Bz are the sizes of the bounding box along each axis. As expected, and looking at Tables 1 and 2, we note that GPU-based implementation is far faster than CPU-implementation because all cubes that are processed at the same time, with each cube being processed by a single thread. Due to this, we obtained a maximum gain of 14x for the 1BIJ molecule (last entry in Tables 1 and 2) depicted in Figure 6(q). Table 1: CPU times (in seconds). 6.2 GPU Performance The time performance results obtained using the GPU-based algorithm for molecular surfaces are presented in the Table 2. As shown in Table 2, there are not significant differences in rendering any molecular surface using distinct blending functions. This happens so because of the parallelization taking place on the GPU kernels. Figure 2: CPU runtime performance. Also comparing the CPU (Table 1) and GPU (Table 2) results, we note slightly differences in the number of cubes. The rationale behind these differences are: • Distribution of the GPU Processing Load. The number of threads per block —in a grid of thread blocks— is a multiple of 2; we have used 128 threads per block. This means that the number of cubes of the bounding box that encloses the molecule must also be a multiple of two because each cube is processed by a single thread. • Cube Indexing. We use the formula (Bx × By × Bz) for the CPU implementation, that is, 3D arrays. However, taking into account that GPU 3D arrays were not available at the starting time of this project, we had to use GPU 1D arrays to store cube data, being 536 Figure 3: GPU runtime performance. Figures 2 and 3 show the experimental time complexity of the CPU- and GPU-based algorithms, respectively. The CPU-based algorithm seems to have quadratic running time complexity because the graph approximates a parabola. On the other hand, the GPU-based algorithm seems to have linear running time complexity because, apart the unstable behaviour for small molecules (as a consequence of the nonuniform distribution of atoms), the graph approximates a line. These results are also due to the kernel optimization allowed for the occupancy calculator provided by Nvidia. This calculator allows us to know in advance the multiprocessor occupancy of a GPU by a certain kernel. This way, we can maximize the number of threads that are supported by a multiprocessor, which has a set of W registers available for each kernel. This optimization ensures that our program does not fail to launch a kernel, simply because we have the guarantee that no more than W registers will be used, getting out this way the maximum performance from the GPU. compute capability version for our GPU (1.3 in our case) in the green box. Then, we introduce the three resource usage parameters in the orange box; the first field concerns the number of threads per block, the second indicates the number of registers per thread, while the third stands for the amount of shared memory (bytes) per block. This concludes the configuration procedure with relevant data appearing in the remaining four boxes automatically, as illustrated in Figure 4, and in Figure 5. To use the occupancy calculator we must fill some parameters in a spreadsheet provided by Nvidia. These parameters are obtained from the code compilation, and are stored into an output file. By analyzing the content of this file we know how many memory registers, local memory, and shared memory that a kernel will actually require in runtime. Putting these values in the occupancy calculator shows us the occupancy rate of each multiprocessor. For example, in Figure 4, we have the parameters for the 5th GPU kernel. Figure 5: Graphs of the occupancy calculation. Figure 4: The parameters for the 5th GPU kernel. The parameter configuration shown in Figure 4 involves a multi-step procedure. First, we must choose the correct 537 In respect to the 5th kernel, we got a maximum multiprocessor occupancy of 63% by varying the three parameters mentioned above in the orange box. The other kernels have about 100% multiprocessor occupancy as shown in Table 3. As Table 3 shows, the occupancy of each multiprocessor for # kernel 1th 2th 3th 4th 5th 6th Threads Registers Shared Memory Occupancy per per per of each Block Thread Block(bytes) Multiprocessor kernel 1 18 140 100% kernel 128 8 100 100% kernel 128 8 116 100% kernel 128 3 60 100% kernel 128 25 150 63% kernel 128 16 148 100% Table 3: CUDA GPU occupancy calculator. our algorithm depends on the number of registers that each kernel has. This is so because each kernel has its own processing load, which depends on the type of code instructions presented in each kernel. This means that, when we are programming using the CUDA API, we must be careful in using loop instructions (i.e. for, do-while, and while) because these instructions occupy many registers, decreasing the efficiency of a kernel. Thus, it is a good practice to reduce the number of variables to a minimum, and avoid using loop instructions as much as possible. 7. CONCLUSIONS As known, biologists and biochemists have long used rigid three-dimensional models to make clear the structure of molecules, which commonly are modeled either as stick-diagrams that emphasise the covalent bonds among atoms, or as spacefilling diagrams that describe the space they occupy. In this respect, surfaces play a fundamental role in molecular functions, because chemical-physical actions are driven by their mechanical and electrostatic properties. In this paper we have presented a CUDA-based method to render smooth molecular surfaces on GPU. The main contribution of this work is a multi-threaded GPU-based implementation of Marching Cubes algorithm to triangulate and render molecular surfaces, with all the particularities of computing the surface locally in the neighborhood of each atom. Also, a comparison to a CPU-based implementation of the Marching Cubes has been carried out for a number of different molecules. Besides, taking into consideration the experimental linear time complexity of the GPU-based algorithm, we hope in the near future to design and implement a scalable parallel algorithm using a cluster of GPUs in order to process and render large molecules (i.e. with hundreds of thousands of atoms) in real-time. The source code is available at: http://www.di.ubi.pt/~agomes/software/molcuda.zip 8. REFERENCES [1] CUDA Programming Guide 2. Available at: http://developer.download.nvidia.com, 2008. [2] Technical Brief: NVIDIA Geforce GTX 200 GPU Architectural Overview. Available at: http://developer.download.nvidia.com, 2008. [3] J. F. Blinn. A generalization of algebraic surface drawing. ACM Trans. Graph., 1(3):235–256, July 1982. 538 [4] J. Bloomenthal and K. Shoemake. Convolution surfaces. Computer Graphics, 25(4):251–256, 1991. [5] T. Can, C.-I. Chen, and Y.-F. Wang. Efficient molecular surface generation using level-set methods. Journal of Molecular Graphics and Modelling, 25:442–454, 2006. [6] M. Connolly. Analytical molecular surface calculation. Journal of Applied Crystallography, 16(5):548–558, October 1983. [7] M. Connolly. Solvent-accessible surfaces of proteins and nucleic acids. Science, 221(4612):709–713, August 1983. [8] D. D’Agostino, I. Merelli, A. Clematis, L. Milanesi, and A. Orro. A parallel workflow for the reconstruction of molecular surfaces. In C. Bischof, M. Bücker, P. Gibbon, G. Joubert, T. Lippert, B. Mohr, and F. Peters, editors, Parallel Computing: Architectures, Algorithms and Applications, volume 15 of Advances in Parallel Computing, pages 147–154. IOS Press, 2008. [9] C. Dyken, G. Ziegler, C. Theobalt, and H.-P. Seidel. High-speed marching cubes using histopyramids. Computer Graphics Forum, 27(8):2028–2039, 2008. [10] R. Geiss. Generating complex procedural terrains using the GPU. In H. Nguyen, editor, GPU Gems 3. Addison-Wesley Professional, New Jersey, USA, 2007. [11] M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix sum scan with cuda. In G. G. 3, editor, Hubert Nguyen. Addison-Wesley Professional, Upper Saddle River, New Jersey, Aug 2007. [12] W. Heiden and T. G. J. Brickmann. Fast generation of molecular surfaces from 3D data fields with an enhanced “marching cube” algorithm. Journal of Computational Chemistry, 14(2):246–250, 1993. [13] M. C. Lawrence and P. Bourki. Conscript: a program for generating electron density isosurfaces for presentation in protein crystallography. Journal of Applied Crystallography, 33:990–991, 2000. [14] B. Lee and F. Richards. The interpretation of protein structures: Estimation of static accessibility. Journal of Molecular Biology, 55(3):379–380, February 1971. [15] C. Levinthal. Molecular model-building by computer. Scientific American, 214(6):42–52, June 1966. [16] W. Lorensen and H. Cline. Marching cubes: A high resolution 3d surface construction algorithm. ACM SIGGRAPH Computer Graphics, 21(4):163–169, July 1987. [17] I. Merelli, L. Milanesi, D. D’Agostino, A. Clematis, M. Vanneschi, and M. Danelutto. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 1st International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05), pages 288–294, Washington, DC, USA, 2005. IEEE Computer Society. [18] T. Newman, J. Byrd, P. Emani, A. Narayanan, and A. Dastmalchi. High performance simd marching cubes isosurface extraction on commodity computers. Computers & Graphics, 28:213–233, 2004. [19] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU computing. Proceedings of the IEEE, 96(5):879–899, 2008. [20] F. Richards. Areas, volumes, packing, and protein structure. Annual Review of Biophysics and Bioengineering, 6(3):151–176, February 1977. [21] N. Tang and T. Newman. A vector-parallel realization of the marching cubes. Technical Report TR-UAH-CS-1996-01, Computer Science Department, The University of Alabama in Huntsville, USA, 1996. [22] Y. Uralsky. DX 10: Pratical metaballs and implicit surfaces. In Game Developers Conference, 2006. [23] Y. Vorobjev and J. Hermans. Sims: Computation of a smooth invariant molecular surface. Biophysical Journal, 73(2):722–732, 1997. [24] G. Wyvill, C. McPheeters, and B. Wyvill. Data Structure for Soft Objects. The Visual Computer, 2(4):227–234, February 1986. [25] Z. Yu, M. Holst, Y. Cheng, and J. McCammon. Feature-preserving adaptive mesh generation for molecular shape modeling and simulation. Journal of Molecular Graphics and Modelling, 26:1370–1380, 2008. 539 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (l) (m) (n) (o) (p) (q) Figure 6: Examples of molecular surfaces displayed using multi-threaded Marching Cubes algorithm. (a PDB ID:110d), (b - PDB ID:200d), (c - PDB ID:1ql1), (d - PDB ID:4pti), (e - PDB ID:1bk2), (f - PDB ID:2qzf ), (g - PDB ID:2qzd), (h - PDB ID:2ot5), (i - PDB ID:1hh0), (j - PDB ID:1hgv), (l - PDB ID:1hgz), (m - PDB ID:2ins), (n - PDB ID:1ql2), (o - PDB ID:1izh), (p - PDB ID:1gt0), (q - PDB ID:1bij). 540