CUDA-based Triangulations of Convolution Molecular Surfaces Sérgio Dias Kuldeep Bora

advertisement
CUDA-based Triangulations of Convolution Molecular
Surfaces
Sérgio Dias
Kuldeep Bora
Abel Gomes
Universidade da Beira Interior
Departamento de Informática
6201-001 Covilhã
Portugal
Indian Institute of Technology
Department of Computer
Science and Engineering
781039 Guwahati, Assam
India
Universidade da Beira Interior
Departamento de Informática
6201-001 Covilhã
Portugal
sdias@ubi.pt
kuldeep@iitg.ernet.in
agomes@di.ubi.pt
ABSTRACT
1.
Computing molecular surfaces is important to measure areas and volumes of molecules, as well as to infer useful information about interactions with other molecules. Over the
years many algorithms have been developed to triangulate
and to render molecular surfaces. However, triangulation
algorithms usually are very expensive in terms of memory
storage and time performance, and thus far from real-time
performance. Fortunately, the massive computational power
of the new generation of low-cost GPUs opens up an opportunity window to solve these problems: real-time performance and cheap computing commodities. This paper just
presents a GPU-based algorithm to speed up the triangulation and rendering of molecular surfaces using CUDA. Our
triangulation algorithm for molecular surfaces is based on
a multi-threaded, parallel version of the Marching Cubes
(MC) algorithm. However, the input of our algorithm is not
the volume dataset of a given molecule as usual for Marching Cubes, but the atom centers provided by the PDB file of
such a molecule. We also carry out a study that compares
a serial version (CPU) and a parallel version (GPU) of the
MC algorithm in triangulating molecular surfaces as a way
to understand how real-time rendering of molecular surfaces
can be achieved in the future.
Roughly speaking, a molecule is a set of atoms. Computergenerated representations of molecules go back to Levinthal’s
work in 1966, who first used the stick model (i.e. a line segment for each bond) to display the 3D structure of molecules
on screen [15] . But, as Levinthal noted, to understand the
interactions between biomolecules, the location of the molecular surface plays a more important role than the location
of the bonds. In order to better understand the properties
of biomolecules and their interactions, we need more sophisticated 3D models of biomolecules capable of providing us
information concerning volumes, areas, shape matching data
between biomolecules, and so forth. There are three popular
mathematical surface models for biomolecules in computational biology and biochemistry, namely:
Categories and Subject Descriptors
I.3.5 [Computer Graphics ]: Computational Geometry
and Object Modeling—Curve, surface, solid, and object representations; Geometric algorithms, languages, and systems;
J.3 [Life and Medical Sicences]: Biology and genetics; J.2
[Physical Sciences and Engineering]: Chemistry; I.3.1
[Computer Graphics]: Hardware Architecture—Graphics
processors
General Terms
Algorithms, Experimentation, Measurement, Performance
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
HPDC'10, June 20–25, 2010, Chicago, Illinois, USA.
Copyright 2010 ACM 978-1-60558-942-8/10/06 ...$10.00.
531
INTRODUCTION
• The van der Waals surface. The van der Waals (VDW)
surface is just the boundary of the union of (solid)
balls, each representing an atom.
• The Lee-Richards Surface. The Lee-Richards surface is
also known as the solvent-accessible surface (SAS) [14].
This surface considers the effect of the solvent (e.g.
water) on the molecule. For water, we frequently use
a solvent sphere radius of 1.4 A◦ . The Lee-Richards
surface is the VDW surface with the van der Waals
radii of atoms inflated by the solvent sphere radius.
• The Richards-Connolly Surface. This surface is also
known as the solvent-excluded surface (SES), or simply the Connolly surface [7], and also considers the
effect of the solvent probe sphere. In 1977, Richards
defined this molecular surface as being made up of
two parts: the contact surface and the reentrant surface [20]. The contact surface is compounded from the
spherical surface patches of the atoms that are acessible to or in contact with the probe sphere; consequently, they are convex. The reentrant surface comes
from the inward-facing surface of the solvent probe
sphere when it is simultaneously in contact with two
or more atoms [6]. The reentrant surface includes
saddle toroidal patches and concave spherical patches.
Toroidal patches are fillets tangent to two atom spheres
formed when the probe sphere rolls in contact with the
two atoms over their circular intersections. The concave spherical patches are the regions where the probe
is in contact with three atoms simultaneously. Following this definition of Richards’ molecular surface, Connolly presented a method for calculating it [6]; hence
the Richards-Connolly surface.
The precise definition of the outer surface of a macromolecule
is important to study the structure and function of proteins
and nucleic acids because it is the surface of the molecule
that binds ligands and other macromolecules. For example,
drug-nucleid acid interactions require a formal geometrical
representation of molecular surfaces. For small molecules,
both van der Waals surface and Lee-Richards surface provide adequate representations of the outer surface. But, for
large molecules, part of these surfaces is buried in the interior. This fact led Richards and Connolly to define and construct a new surface model for biomolecules, the so-called
Connolly surface [6].
Another problem with the van der Waals surface and its inflated variant, i.e. the Lee-Richards surface, is that they
both present sharp crevices where atoms intersect, what
poses some difficulties in predicting the shape and possible
functions of molecules. The Connolly surface constitutes a
first attempt to construct a smooth molecular surface. However, Connolly’s method of computing the molecular surface suffers from two defficiencies. First, it produces a selfintersecting surface [6], from which we get the Connolly surface after removing the redundant self-intersecting patches.
Second, the Connolly surface generated in this manner is
not smooth because it has abrupt changes between positive
and negative curvature, and some points of the surface are
singular [23].
The first smooth (i.e. infinitely differentiable) molecular surface was proposed by Blinn [3]. At that time, rendering the
Blinn surface was carried out on a pixel-to-pixel basis, that
is, for each pixel, the nearest point on the surface was calculated using a Newton iterator, being the normal vector determined from the gradient of the surface-defining function.
That is, Blinn used a kind of ray caster to render molecular surfaces. In essence, the Blinn surface is a convolution
surface of the atom centers of the molecule.
The purpose of this paper is just to describe a new GPUspecific algorithm to triangulate and render Blinn surfaces.
This algorithm has led to a CUDA-based parallel implementation of Marching Cubes (MC) for molecules. The main
novelty of this algorithm lies in the design of a parallelized
MC algorithm specifically for molecules. Besides, we carry
out a comparative study of the time performance between
CPU- and GPU-based triangulation algorithms to render
molecular surfaces.
This paper is organized as follows. Section 2 briefly describes the related work. Section 3 introduces the mathematical formulation of smooth molecular surfaces. Section 4
briefly presents the CUDA technology. Section 5 describes
our CUDA-based triangulation algorithm for Blinn surfaces.
Section 6 carries out a time performance analysis and comparison between CPU- and GPU-based marching cubes algorithms. Finally, Section 7 summarizes the main results
obtained with our CPU- and GPU-based algorithms, and
points out directions for future work.
532
2.
RELATED WORK
Depending on the input data, there are two categories of
MC-based triangulation algorithms for molecular surfaces:
• Volume datasets. This the classical input for the MC
algorithm, which consists of a sliced grid of 3D pixels
or voxels. In this case, given a threshold, we simply
apply the MC algorithm to every single cube defined by
8 voxels, i.e., the digital vertices of an imaginary cube
of the volume dataset, as usual for biomedical data
sets (e.g., a skull data set) in order to triangulate the
unknown surface within such a cube. Unexpectedly, we
have only found two representative algorithms of this
category for molecular surfaces in the literature; the
first is due to Heiden et al. [12], whereas the second as
written by Tang and Newman [21]. See Lawrence and
Bourke [13] for similar approach to generate electron
density isosurfaces using MCs.
• Atom centers. Here, we start with the set of atom centers of a molecule, not with a discretized volume data
set. This means that the regularly spaced 3D grid of
voxels (now vertices) has to be calculated before entering the MC algorithm. Merelli et al. use this technique to extract and render van der Waals surfaces,
as well Lee-Richards surfaces [17], while D’Agostino
et al. use similar approach to extract Connolly surfaces [8]. Can et al. propose a MC-based method that
provides a framework for computing van der Waals,
Lee-Richards, and Connolly surfaces using a kind of
distance field using level-sets [5]. Yu et al. also use
centers and atom radii as inputs for their algorithm
to render Blinn’s molecular surfaces with a Gaussian
kernel [25].
Obviously, there are several parallel implementations of the
MC algorithm. At our best knowledge, there are only a few
implementations of MCs using shaders that run completely
on the GPU side; see Geiss [10], Uralsky [22], and [9]. Both
use the SM4 (Shader Model 4). This was made possible
because, unlike the previous shader models, SM4 admits the
generation of triangles on the GPU side.
Interestingly, most parallelized isosurfacing algorithms use
the SIMD (Single Instruction, Multiple Data) architecture
and also the MIMD (Multiple Instructions, Multiple Data)
architecture. See, for example, Newman et al. [18] for a
good chunk of references about this topic.
Most these parallelized MC algorithms fall into the volume
dataset category. The only parallelized algorithms whose
input is a set of ball (or atom) centers are the ones due
to Uralsky [22] and D’Agostino et al. [8]. The first uses
SM4, while the uses SIMD. The input of our SIMD-based
MC algorithm also uses the set of atoms of a given molecule
but, unlike others found in the literature, it uses CUDA programming model from Nvidia. It renders Blinn’s molecular
surfaces using different atomic kernels. Note that Nvidia
already provides a CUDA SDK code sample for Marching
Cubes that extracts an isosurface from a volume dataset;
consequently, it does not work for extracting a molecular
surface from its set of atom centers. However, these two
CUDA-based MC algorithms share the second, third and
forth kernels described in Section 5.
3.
SMOOTH MOLECULAR SURFACES
A molecule is a set of atoms that can be mathematically
described as a union of balls, each representing an atom. In
this paper, we are interested in analytical formulations of
molecular surfaces that result from summing up local functions that describe the electrical field of atoms. This is so
because each atom has its own electrical field, which can
be described by an analytical function that decays with the
distance. It is clear that there several functions that can
describe the electrical field of an atom, namely:
• Gaussian function [3].
fi (x, y, z) = ae−br
2
(1)
where b is the standard deviation of Gaussian curve, a
is the height of Gaussian curve, and r is the distance
to the atom center.
• Soft object function [24].
(
6
4
a(1 − 4r
+ 17r
−
9b6
9b4
fi (x, y, z) =
0
r>b
22r 2
)
9b2
r≤b
(2)
where b is the standard deviation of the curve, a is the
height of the curve, and r is the distance to the center
of atom i.
• Inverse squared function.
(3)
where C stands for the smoothness or blobiness parameter and ri = (x − xi )2 + (y − yi )2 + (z − zi )2 is
the squared distance from the center (xi , yi , zi ) of the
atom i to a generic point (x, y, z). Typically, the parameter C takes on a value in the interval ]0, 1].
Thus, the electrical field of each atom in the molecule is
formulated as a distance function, i.e. an implicit scalar
function. The molecular surface is the result of the summation of the distance functions associated to all atoms, i.e.
the summation of electric fields of all the atoms, as follows:
N
X
4.
fi (x, y, z)
(4)
i=0
where N is the total number of atoms of the molecule. The
atomic local functions fi work as blending functions of the
resulting molecular surface, which will be hopefully smooth.
The smoothness of the molecular surface depends on the
smoothness of the blending functions. Thus, the blending
functions must be differentiable. A surface with this characteristics is referred to as convolution surface [4].
A molecular surface formulated in this manner is an implicit
surface that is defined by a function F with domain R3 and
range R. This implicit molecular surface is given by all the
533
GPU PROGRAMMING USING CUDA
We took the decision of running our program on GPU because our algorithm fits very well in the parallel computing
model. In fact, the voxelization of the axis-aligned bounding box that encloses the molecule allows us to speed up the
program by assigning a thread to each cube of the bounding box. So, before proceeding any further, let us first see
important details related to CUDA technology and GPU
programming.
4.1
CUDA Architecture
CUDA (Compute Unified Device Architecture) gives dataintensive applications access to the tremendous processing
power of Nvidia graphics processing units (GPUs) through
a revolutionary computing architecture unleashing entirely
new capabilities [19]. In our work, we have used a Nvidia
GeForce GTX 280 graphics card. This graphics card includes 10 thread processing clusters (TPCs), where each
TPC consists of 3 streaming multiprocessors (SMs), with
each SM having 8 stream processors, which yields 240 cores
in total. Each core processes threads; hence the parallel
computing model on GPU. To control and distribute the
massive computing load by the 240 cores, we use a thread
scheduler to guarantee a nearly full utilization of the GPU [2].
4.2
C
fi (x, y, z) =
ri
F (x, y, z) =
points in R3 that satisfy the function F (x, y, z) = T , where
T is the isovalue. That is, it corresponds to the level set
defined by the constant T . Thus, F > T for points outside
the surface, and F < T for points inside the surface, and
F = T for points on the surface.
CUDA Programming Model
When we are programming in CUDA we must take care
of how a program is going to be executed. In this paper, a CUDA program is written in C, in which we have
CPU-specific functions and GPU-specific functions. A CPUspecific function runs on the CPU (host), while a GPUspecific function is executed on the GPU (device) as shown
in Figure 1. There are three types of GPU-specific qualifiers. The first, named __global__, acts as prefix of functions that are GPU kernels, that is, GPU functions that are
invoked from CPU-specific code. The second, named __device__, qualifies a GPU function that can be only called
from a GPU kernel. Finally, the third qualifier is the keyword __shared__ that acts as a prefix of a variable allocated
in the streaming multiprocessors shared memory. Note that
GPU-specific functions are prohibited to be recursive.
There are other three important CUDA functions to (dis)allocate memory on the GPU side and to exchange data between CPU and GPU. The cudaMalloc is used to allocate
memory on GPU, while the cudaFree function releases memory space from GPU. The cudaMemcpy function serves to
transfer data from the CPU side to GPU side, and viceversa. Note that a kernel is a GPU-specific function that
launches threads, hierarchically organized into grid–blocks–
threads [1, 19], as illustrated in Figure 1.
The compilation of a CUDA program is done in three stages.
The first extracts the CPU code from the CUDA file into
an intermediate file that is then passed to the standard C
compiler. Next, the remaining code, that is the GPU code,
it is set to 0. After setting the flag bits for the each cube,
the algorithm uses the lookup table to determine which of
the 256 cases of surface fits inside the cube. When this is
done, the algorithm goes a step forward to interpolate the
surface vertices, for each cube along the appropriate edges,
by using the lookup tables. After computing the surface
vertices for all cubes, we are ready to generate the triangles
inside each cube, which together form the the final mesh
that approximates the molecular surface. It is clear that
rendering this surface mesh requires the calculation of the
triangle normals. The reader is referred to Lorensen and
Cline [16] for further details on MCs.
5.2
GPU Implementation
We use CUDA to implement a variant of Marching Cubes
(MC) algorithm for molecular surfaces. All MC computations are carried out on the GPU. The algorithm can be
described as follows:
Figure 1: GPU architecture [2].
is converted to a PTX file (i.e., a kind of assembly language).
The third stage translates this PTX file into GPU-specific
commands and encapsulates them in an executable file.
5.
TRIANGULATING BLINN SURFACES
Amongst all the available triangulation algorithms used to
render molecular surfaces, the Marching Cubes algorithm
is possibly the most adequate for GPU processing. This is
justified by the fact that MCs lead to an uniform space partition of the bounding box into cubes (defined by 8 vertices
or voxel centers), which is ideal for the design of a possible
parallel implementation on GPU.
5.1
Marching Cubes: Overview
This algorithm starts by dividing the space that contains
the molecule into an uniform grid of cubes of appropriate
size. The next step is the calculation of the electrical field
intensity F at all cube vertices (or voxels) of the grid (Equation 4). Doing this involves very time-consuming computations, due to the fact that molecules may have long chains
of atoms, and the spatial distribution of these atoms often
requires very large grids, even for molecules with a small
number of atoms. To deal with this problem, our algorithm
calculates the intensity in a per atom charge fashion, instead of calculating it at each grid vertex directly. Since the
influence of the field can be neglected beyond some appropriate distance, this optimization can be used in general for
intensity calculations. Next, we calculate the position of the
cube for each atom, and then we define a sub-grid centered
on such an atom/cube. The intensity contribution of this
atom adds to the current intensity of each sub-grid vertex.
Terminated the computation of the intensity F for all grid
vertices, we have to determine the 8-bit flag for each cube.
This flag characterises the surface inside each cube. The
flag bits form an index to a lookup table to determine which
edges of each cube intersect the isosurface. If F ≥ T at a
grid vertex, the corresponding flag bit is set to 1; otherwise
534
• Bounding Box Computation (CPU side). We first determine the axis-aligned bounding box that encloses a
given molecule is determined while reading in the centers of atoms from a PDB file into an 1-dimensional
array of size M , where M is the number of atoms of
the molecule. The computation of the bounding box
means here computing the locations of the opposite
vertices of the diagonal of the bounding box.
• Cube Decomposition (CPU side). Next, we determine
the number N = I × J × K of cubes with a predefined size, where I, J, and K stand for the number
of cubes along x-, y-, and z-axis, respectively, that fit
the bounding box.
• Allocation and Copying of Data Structures to GPU.
This step involves the allocation of GPU memory for
ten 1-dimensional arrays, namely the array of M atomic
centers and the array of N integers, one integer (identifier) for each cube for ahead association between each
voxel to a single thread. These allocation and copy operations are carried out by calling the cudaMallloc and
cudaMemcpy functions from the CPU side. These two
functions, together with the cudaBindTexture function, are also used to allocate and copy the lookup
tables of the MC algorithm to GPU texture memory
because these tables remain unchanged. The previous
array of N integers serves the purpose of storing the
number of vertices of the triangulation of the molecular surface patch inside each cube, as needed in a later
step.
• Computation of Electric Field Intensities at Voxels (1st
GPU kernel). This kernel computes the intensity of
electric field at each voxel (i.e., cube vertex); this value
is initialized to zero at each voxel. Because the number of cubes largely exceeds the number of atoms, the
intensity computations are carried out per atom rather
than per cube. This is reinforced by the fact that
some biological molecules have long separate chains
of carbon atoms (see Figure 6(n), which implies that
the spatial distribution of these atoms originates very
large bounding boxes even for molecules with a small
number of atoms. The per-atom intensity computation
only occurs within a boxed neighborhood around each
atom, i.e. at the voxels surrounding each atom center. The local box centered at each atom has usually
20×20×20 voxels. Thus, taking into consideration the
Gaussian-like behavior of the local function fi of the
atom i, we are assuming that fi takes on the value 0
beyond surrounding box of the atom. In other words,
the intensity contribution fi of the atom i at each voxel
of its surrounding box is added to the current intensity value F of such voxel. The parallelization of this
intensity computation is accomplished by running a
single-thread grid for each atom, so we end up having
N threads for N atoms running simultaneously.
the intermediate mapping array is necessary because
the threads run in parallel.
• Computation of Surface Vertices by Linear Interpolation (5th GPU kernel). This GPU kernel uses linear
interpolation along the cube edges that intersect the
molecular surface in order to sample the surface. The
lookup tables are here useful to tell us which are the
edges that intersect the surface. The resulting sampling points of the surface will be the triangle vertices
of the mesh that approximates the surface. These surface points are computed for each cube and are stored
into the VBO array using the previous mapping array.
• Computation of Surface Normals at the Vertices in
VBO Array (6th GPU kernel). Taking into account
that the vertices stored in the VBO array are points
of the molecular surface, the normal at a given surface
point is given by the gradient vector as follows:
• Computation of MC Configurations for Cubes (2nd
GPU kernel). After computing the intensities F for all
voxels, we determine the 8-bit flag —a bit per voxel—
that corresponds to 1 out of 256 MC surface patterns
we may find inside each cube. A bit has the value 0
if the corresponding voxel has an intensity under the
pre-defined isovalue, and 1 if the intensity is over the
isovalue. This 8-bit flag works as an index to a lookup
table for determining which edges of the cube intersect
with molecular isosurface. The parallelization of this
process is done running a number of simultaneous grids
with 128 threads, where each thread computes the 8bit flag of a single cube. The number of grids depends
on the number of cubes of the bounding box enclosing
the molecule. The flags associated to the cubes are
stored into a specific 1-array. Each flag is used to fill
in the aforementioned array of N integers, which stores
the number of vertices of the triangulation inside the
corresponding cube.
• Allocation of VBO Array for Posterior Rendering of
Surface Mesh (3rd GPU kernel). By performing a Parallel Prefix Sum (scan) [11] of the number of mesh vertices in the array of N integers of the previous step,
we determine the total number of vertices that sample
the molecular surface, from which we triangulate the
surface. Recall that this triangulation will be done locally inside each cube, as usual in the MC algorithm.
This Parallel Prefix Sum operation is carried by calling
the function cudppScan, which is essentially another
GPU kernel. The number of vertices output by cudppScan is then used to allocate a VBO (Vertex Buffer
Objects) array for the mesh vertices, from which we
can later generate triangles to render the mesh that
approximates the molecular surface.
• Setting up a Mapping Array Between Array of N Integers and VBO Array (4th GPU kernel). This kernel
creates an intermediate array of N elements between
the array of N integers —recall that each integer yields
the number of vertices on the surface crossing each
cube— and the corresponding VBO array, which allows to map a given cube (and its surface vertices)
to its first surface vertex in the VBO array. That is,
for each cube, the mapping array stores the index of
the VBO array element where we will later store the
first surface vertex associated with such a cube. This
allows to create an association between the occupied
cubes (e.g. cubes transverse to the molecular surface)
and the correct position in the VBO array. Note that
535
∇F = (
∂F ∂F ∂F
,
,
)
∂x ∂y ∂z
(5)
where
N
X
−2 C (x − xi )
∂F
=
2
2
2 2
∂x
[(x
−
x
i ) + (y − yi ) + (z − zi ) ]
i=1
N
X
−2 C (y − yi )
∂F
=
2
2
2 2
∂y
[(x
−
x
i ) + (y − yi ) + (z − zi ) ]
i=1
and
N
X
−2 C (z − zi )
∂F
=
2
2
2 2
∂z
[(x
−
x
i ) + (y − yi ) + (z − zi ) ]
i=1
These normals are stored in a separate array.
• Rendering the Molecular Surface (CPU side). Taking
into consideration that the VBO array in the GPU can
be accessed from the CPU side, we have only to display the VBOs to render the surface. We use Gouraud
shading that comes with OpenGL.
6.
RESULTS
In our tests, we have used a Windows XP PC equipped with
a Quad Core Q9550 running at 2.83 GHz, and 4GB RAM,
and a Nvidia GeForce GTX 280 CUDA-programmable graphics card. We have written two programs for the same algorithm:
• CPU-based Sequential Program. This program was
written in C++, and makes usage of the STL (Standard Template Library).
• CUDA-based Parallel Program. This program was written in C, and makes usage of the CUDA API v2.0.
Both programs use single precision floating-point numbers.
We have used 16 different molecules, read in from the corresponding .pdb files, to compare the time performance of
both algorithms, as shown in Tables 1 and 2. These .pdb
files are ASCII files and were taken from the Protein Data
Bank (PDB) at www.rcsb.org, which is a repository for 3D
structural data of molecules. Each molecule has a unique
ID (see first column of both Tables 1 and 2 ).
The time performance results output by both programs are
presented in Tables 1 and 2. In both tables we have columns
for the number of atoms (# Atoms) and the number of cubes
(# Cubes). The remaining columns concern the time performance (in seconds) of the algorithm using 3 different local
blending functions.
6.1
CPU Performance
As shown in Table 1, the inverse squared function is computationally less expensive than the other two blending functions, because it involves a less number of arithmetic operations. In addition, the soft object function is better than the
Gaussian function because it is a truncated Taylor expansion
of the exponential used in the Gaussian function.
ID
# Gaussian Soft Object Inverse squared
#
Atoms Function Function
Function
Cubes
(sec)
(sec)
(sec)
110D 120
0.68
0.45
0.43
346580
200D 259
1.90
1.40
1.20
603520
1QL1 322
2.79
2.14
1.90
1581370
4PTI 381
4.89
3.70
3.00
1043464
1BK2 468
4.89
3.70
3.00
650160
2QZF 479
5.55
4.30
3.60
3536664
2QZD 507
6.10
4.70
4.00
3860800
2OT5 545
6.00
4.80
4.14
1154032
1HH0 691
6.40
5.50
4.46
1632000
1HGV 691
6.56
5.50
4.57
1970752
1HGZ 691
6.30
5.45
4.36
1605760
2INS 781
11.10
9.12
7.30
1316250
1QL2 966
16.10
13.90
11.50
4619920
1IZH 1521
37.40
31.20
25.10
2578680
1GT0 2746 102.00
94.14
67.87
5681590
1BIJ 4387 250.50
240.20
179.10
7229120
ID
# Gaussian Soft Object Inverse squared
#
Atoms Function Function
Function
Cubes
(sec)
(sec)
(sec)
110D 120
0.13
0.13
0.13
352000
200D 259
0.27
0.27
0.26
610176
1QL1 322
0.38
0.37
0.37
1600896
4PTI 381
0.41
0.40
0.40
1053440
1BK2 468
0.36
0.35
0.35
657920
2QZF 479
0.88
0.72
0.72
3562624
2QZD 507
0.94
0.93
0.94
3901824
2OT5 545
0.56
0.53
0.51
1167360
1HH0 691
0.73
0.69
0.67
1652608
1HGV 691
0.89
0.74
0.70
1994752
1HGZ 691
0.68
0.65
0.64
1626112
2INS 781
0.84
0.82
0.82
1326976
1QL2 966
2.10
2.10
2.00
4654208
1IZH 1521
2.40
2.22
2.15
2601472
1GT0 2746
7.52
7.54
6.90
5720832
1BIJ 4387
15.00
14.40
14.20
7267456
Table 2: GPU times (in seconds).
(Bx × By × Bz) + (By × Bz) + (Bz) the corresponding indexing formula, where Bx, By, Bz are the sizes
of the bounding box along each axis.
As expected, and looking at Tables 1 and 2, we note that
GPU-based implementation is far faster than CPU-implementation because all cubes that are processed at the same
time, with each cube being processed by a single thread.
Due to this, we obtained a maximum gain of 14x for the
1BIJ molecule (last entry in Tables 1 and 2) depicted in
Figure 6(q).
Table 1: CPU times (in seconds).
6.2
GPU Performance
The time performance results obtained using the GPU-based
algorithm for molecular surfaces are presented in the Table 2. As shown in Table 2, there are not significant differences in rendering any molecular surface using distinct
blending functions. This happens so because of the parallelization taking place on the GPU kernels.
Figure 2: CPU runtime performance.
Also comparing the CPU (Table 1) and GPU (Table 2) results, we note slightly differences in the number of cubes.
The rationale behind these differences are:
• Distribution of the GPU Processing Load. The number
of threads per block —in a grid of thread blocks— is a
multiple of 2; we have used 128 threads per block. This
means that the number of cubes of the bounding box
that encloses the molecule must also be a multiple of
two because each cube is processed by a single thread.
• Cube Indexing. We use the formula (Bx × By × Bz)
for the CPU implementation, that is, 3D arrays. However, taking into account that GPU 3D arrays were
not available at the starting time of this project, we
had to use GPU 1D arrays to store cube data, being
536
Figure 3: GPU runtime performance.
Figures 2 and 3 show the experimental time complexity of
the CPU- and GPU-based algorithms, respectively. The
CPU-based algorithm seems to have quadratic running time
complexity because the graph approximates a parabola. On
the other hand, the GPU-based algorithm seems to have
linear running time complexity because, apart the unstable
behaviour for small molecules (as a consequence of the nonuniform distribution of atoms), the graph approximates a
line.
These results are also due to the kernel optimization allowed
for the occupancy calculator provided by Nvidia. This calculator allows us to know in advance the multiprocessor occupancy of a GPU by a certain kernel. This way, we can
maximize the number of threads that are supported by a
multiprocessor, which has a set of W registers available for
each kernel. This optimization ensures that our program
does not fail to launch a kernel, simply because we have
the guarantee that no more than W registers will be used,
getting out this way the maximum performance from the
GPU.
compute capability version for our GPU (1.3 in our case)
in the green box. Then, we introduce the three resource
usage parameters in the orange box; the first field concerns
the number of threads per block, the second indicates the
number of registers per thread, while the third stands for the
amount of shared memory (bytes) per block. This concludes
the configuration procedure with relevant data appearing in
the remaining four boxes automatically, as illustrated in Figure 4, and in Figure 5.
To use the occupancy calculator we must fill some parameters in a spreadsheet provided by Nvidia. These parameters are obtained from the code compilation, and are stored
into an output file. By analyzing the content of this file
we know how many memory registers, local memory, and
shared memory that a kernel will actually require in runtime.
Putting these values in the occupancy calculator shows us
the occupancy rate of each multiprocessor. For example, in
Figure 4, we have the parameters for the 5th GPU kernel.
Figure 5: Graphs of the occupancy calculation.
Figure 4: The parameters for the 5th GPU kernel.
The parameter configuration shown in Figure 4 involves a
multi-step procedure. First, we must choose the correct
537
In respect to the 5th kernel, we got a maximum multiprocessor occupancy of 63% by varying the three parameters
mentioned above in the orange box. The other kernels have
about 100% multiprocessor occupancy as shown in Table 3.
As Table 3 shows, the occupancy of each multiprocessor for
#
kernel
1th
2th
3th
4th
5th
6th
Threads Registers Shared Memory Occupancy
per
per
per
of each
Block Thread Block(bytes) Multiprocessor
kernel
1
18
140
100%
kernel 128
8
100
100%
kernel 128
8
116
100%
kernel 128
3
60
100%
kernel 128
25
150
63%
kernel 128
16
148
100%
Table 3: CUDA GPU occupancy calculator.
our algorithm depends on the number of registers that each
kernel has. This is so because each kernel has its own processing load, which depends on the type of code instructions
presented in each kernel. This means that, when we are
programming using the CUDA API, we must be careful in
using loop instructions (i.e. for, do-while, and while) because these instructions occupy many registers, decreasing
the efficiency of a kernel. Thus, it is a good practice to reduce the number of variables to a minimum, and avoid using
loop instructions as much as possible.
7.
CONCLUSIONS
As known, biologists and biochemists have long used rigid
three-dimensional models to make clear the structure of molecules, which commonly are modeled either as stick-diagrams
that emphasise the covalent bonds among atoms, or as spacefilling diagrams that describe the space they occupy. In this
respect, surfaces play a fundamental role in molecular functions, because chemical-physical actions are driven by their
mechanical and electrostatic properties.
In this paper we have presented a CUDA-based method to
render smooth molecular surfaces on GPU. The main contribution of this work is a multi-threaded GPU-based implementation of Marching Cubes algorithm to triangulate
and render molecular surfaces, with all the particularities of
computing the surface locally in the neighborhood of each
atom. Also, a comparison to a CPU-based implementation
of the Marching Cubes has been carried out for a number of
different molecules.
Besides, taking into consideration the experimental linear
time complexity of the GPU-based algorithm, we hope in
the near future to design and implement a scalable parallel
algorithm using a cluster of GPUs in order to process and
render large molecules (i.e. with hundreds of thousands of
atoms) in real-time.
The source code is available at:
http://www.di.ubi.pt/~agomes/software/molcuda.zip
8.
REFERENCES
[1] CUDA Programming Guide 2. Available at:
http://developer.download.nvidia.com, 2008.
[2] Technical Brief: NVIDIA Geforce GTX 200 GPU
Architectural Overview. Available at:
http://developer.download.nvidia.com, 2008.
[3] J. F. Blinn. A generalization of algebraic surface
drawing. ACM Trans. Graph., 1(3):235–256, July
1982.
538
[4] J. Bloomenthal and K. Shoemake. Convolution
surfaces. Computer Graphics, 25(4):251–256, 1991.
[5] T. Can, C.-I. Chen, and Y.-F. Wang. Efficient
molecular surface generation using level-set methods.
Journal of Molecular Graphics and Modelling,
25:442–454, 2006.
[6] M. Connolly. Analytical molecular surface calculation.
Journal of Applied Crystallography, 16(5):548–558,
October 1983.
[7] M. Connolly. Solvent-accessible surfaces of proteins
and nucleic acids. Science, 221(4612):709–713, August
1983.
[8] D. D’Agostino, I. Merelli, A. Clematis, L. Milanesi,
and A. Orro. A parallel workflow for the
reconstruction of molecular surfaces. In C. Bischof,
M. Bücker, P. Gibbon, G. Joubert, T. Lippert,
B. Mohr, and F. Peters, editors, Parallel Computing:
Architectures, Algorithms and Applications, volume 15
of Advances in Parallel Computing, pages 147–154.
IOS Press, 2008.
[9] C. Dyken, G. Ziegler, C. Theobalt, and H.-P. Seidel.
High-speed marching cubes using histopyramids.
Computer Graphics Forum, 27(8):2028–2039, 2008.
[10] R. Geiss. Generating complex procedural terrains
using the GPU. In H. Nguyen, editor, GPU Gems 3.
Addison-Wesley Professional, New Jersey, USA, 2007.
[11] M. Harris, S. Sengupta, and J. D. Owens. Parallel
prefix sum scan with cuda. In G. G. 3, editor, Hubert
Nguyen. Addison-Wesley Professional, Upper Saddle
River, New Jersey, Aug 2007.
[12] W. Heiden and T. G. J. Brickmann. Fast generation of
molecular surfaces from 3D data fields with an
enhanced “marching cube” algorithm. Journal of
Computational Chemistry, 14(2):246–250, 1993.
[13] M. C. Lawrence and P. Bourki. Conscript: a program
for generating electron density isosurfaces for
presentation in protein crystallography. Journal of
Applied Crystallography, 33:990–991, 2000.
[14] B. Lee and F. Richards. The interpretation of protein
structures: Estimation of static accessibility. Journal
of Molecular Biology, 55(3):379–380, February 1971.
[15] C. Levinthal. Molecular model-building by computer.
Scientific American, 214(6):42–52, June 1966.
[16] W. Lorensen and H. Cline. Marching cubes: A high
resolution 3d surface construction algorithm. ACM
SIGGRAPH Computer Graphics, 21(4):163–169, July
1987.
[17] I. Merelli, L. Milanesi, D. D’Agostino, A. Clematis,
M. Vanneschi, and M. Danelutto. Marching cubes: A
high resolution 3d surface construction algorithm. In
Proceedings of the 1st International Conference on
Distributed Frameworks for Multimedia Applications
(DFMA’05), pages 288–294, Washington, DC, USA,
2005. IEEE Computer Society.
[18] T. Newman, J. Byrd, P. Emani, A. Narayanan, and
A. Dastmalchi. High performance simd marching
cubes isosurface extraction on commodity computers.
Computers & Graphics, 28:213–233, 2004.
[19] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E.
Stone, and J. C. Phillips. GPU computing.
Proceedings of the IEEE, 96(5):879–899, 2008.
[20] F. Richards. Areas, volumes, packing, and protein
structure. Annual Review of Biophysics and
Bioengineering, 6(3):151–176, February 1977.
[21] N. Tang and T. Newman. A vector-parallel realization
of the marching cubes. Technical Report
TR-UAH-CS-1996-01, Computer Science Department,
The University of Alabama in Huntsville, USA, 1996.
[22] Y. Uralsky. DX 10: Pratical metaballs and implicit
surfaces. In Game Developers Conference, 2006.
[23] Y. Vorobjev and J. Hermans. Sims: Computation of a
smooth invariant molecular surface. Biophysical
Journal, 73(2):722–732, 1997.
[24] G. Wyvill, C. McPheeters, and B. Wyvill. Data
Structure for Soft Objects. The Visual Computer,
2(4):227–234, February 1986.
[25] Z. Yu, M. Holst, Y. Cheng, and J. McCammon.
Feature-preserving adaptive mesh generation for
molecular shape modeling and simulation. Journal of
Molecular Graphics and Modelling, 26:1370–1380,
2008.
539
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(l)
(m)
(n)
(o)
(p)
(q)
Figure 6: Examples of molecular surfaces displayed using multi-threaded Marching Cubes algorithm. (a PDB ID:110d), (b - PDB ID:200d), (c - PDB ID:1ql1), (d - PDB ID:4pti), (e - PDB ID:1bk2), (f - PDB
ID:2qzf ), (g - PDB ID:2qzd), (h - PDB ID:2ot5), (i - PDB ID:1hh0), (j - PDB ID:1hgv), (l - PDB ID:1hgz),
(m - PDB ID:2ins), (n - PDB ID:1ql2), (o - PDB ID:1izh), (p - PDB ID:1gt0), (q - PDB ID:1bij).
540
Download