Solving Molecular Distance Geometry Problems in OpenCL

advertisement
Solving Molecular Distance Geometry Problems in
OpenCL
István Lőrentz
Răzvan Andonie
Levente Fabry-Asztalos
Electronics and Computers Department,
Transilvania University, Braşov, Romania
Email: isti spl@yahoo.com
Computer Science Department,
Central Washington University,
Ellensburg, WA, USA
and
Electronics and Computers Department,
Transilvania University, Braşov, Romania
Email: andonie@cwu.edu
Department of Chemistry,
Central Washington University,
Ellensburg, WA, USA,
Email: FabryL@cwu.edu
Abstract—We focus on the following computational chemistry
problem: Given a subset of the exact distances between atoms,
reconstruct the three-dimensional position of each atom in the
given molecule. The distance matrix is generally sparse. This
problem is both important and challenging. Our contribution is
a novel combination of two known techniques (parallel breadthfirst search and geometric buildup) and its OpenCL parallel
implementation. The approach has the potential to speed up
computation of three-dimensional structures of molecules - a
critical process in computational chemistry. From experiments
on multi-core CPUs and graphic processing units, we conclude
that, for sufficient large problems, our implementation shows a
moderate scalability.
I. I NTRODUCTION
Knowing the three-dimensional structures of molecules is
critical in many scientific fields, especially in chemical biology
and medicinal chemistry. They play a critical role in molecular
interactions (e.g. between two biologically significant large
proteins and between an enzyme and its therapeutically important small-molecule inhibitors) and strongly contribute to
chemical and biological properties.
Experimentally structural characteristics can be determined
using X-ray crystallography and nuclear magnetic resonance
(NMR) spectroscopy, while theoretically they can be determined using various potential energy minimization and bioinformatics techniques. In chemistry distance geometry problems often arise, especially when determining the exact structures of large molecules (e.g. proteins). Furthermore, many
chemical structures (large and small) cannot be crystallized;
therefore, their structures cannot be determined using X-ray
crystallography. In addition, when structures are determined
using NMR spectroscopy, especially of very large molecules,
often only partial structural information can be determined due
to the experimental limitations of the method.
The Molecular Distance Geometry Problem (MDGP) aims
to reconstruct the three-dimensional position (x, y, z coordinates) of each atom from the pairwise distances given. More
formally, given a set of n atoms, and the D = {dij } set
of Euclidean distances between atoms, the task is to find
978-1-4673-1653-8/12/$31.00 '2012 IEEE
positions x1 , ..., xn ∈ R3 of the atoms in the molecule such
that ||xi − xj || = dij .
The geometry problem, besides the chemistry applications,
has applications in other fields, like graph drawing [1]–[3] and
placement of wireless sensors. In graph theory, the geometric
distance problem corresponds to finding the embedding of an
undirected graph to R3 , by considering the atoms as a graph’s
vertices and the known interatomic distances as the weights of
the edges. If all pairwise distances are accurately known (in a
complete graph), the classical multidimensional scaling solves
the problem by eigendecomposition of the squared, normalized
distance matrix [4]–[6].
The problem can be also regarded as a nonlinear global optimization problem - to find the conformation X that minimizes
the given stress function:
X
2
σ 2 (X) =
wij (kxi − xj k − δij )
(1)
i<j≤n
where wij is a positive weight associated to the (i, j) pair,
wij = 1 if the distance (i, j) is known and 0 otherwise and
δij is the measured, known distance.
However, real-life data is often characterized by:
• Incomplete distances. Usually only distances between
certain types of atoms closer than a given threshold are
known.
• Imprecise distances, affected by measurement error. In
some cases each distance is given as lower and upper
bound.
Based on these input data, the problem to be solved falls into
one of the following types:
a. Interatomic distances between all pairs are known precisely (corresponding to a complete graph with n(n − 1)/2
edges).
b. All interatomic distances are known, but affected by error.
An upper and a lower bound is given for each.
c. Only a subset of the exact distances are known. The
resulting distance matrix is sparse.
d. A subset of distances affected by measurement error is
known.
1421
We focus on Type c problems. Our contribution is a novel
combination of two known techniques (parallel breadth-first
search and geometric buildup) for solving such problems. We
have implemented and tested our model on a multi-core system
using OpenCL.
After an overview of related work (Section II) and the
OpenCL platform (Section III), in Section IV we describe our
method. Section V contains the numerical results of several
experiments and, in Section VI, we have our final remarks.
II. R ELATED WORK
Good overviews of existing methods can be found in [7]–
[9]. For Type a problems, when all distances are known, a
classical approach is multidimensional scaling (C MDS) [4],
[10]. C MDS uses eigendecomposition to compute a threedimensional representation of the data given by the distance
matrix D[n×n] :
procedure C MDS(D)
D(2) ← {d2ij }
⊲ squared distances
J ← I − {1/n}[n×n]
4:
B ← − 21 JD(2) J
5:
compute eigendecomposition: QΛQ′ = B
6:
let λ1 ≥ λ2 ≥ λ3 > 0
⊲ the largest 3 positive
⊲ eigenvalues of B
7:
Q+ ← [q1 , q2 , q3 ]
⊲ the matrix formed by the
⊲ associated eigenvectors
8:
return X ← Q+ Λ1/2
9: end procedure
In the above algorithm, D(2) represents the distance matrix,
with each element squared, I the [n × n] identity matrix, and
B is the double-centered matrix (sums of rows and sums of
columns are zero). The resulting matrix, X[n×3] contains the
x, y, z coordinates of all the n atoms.
When some of the entries of the D matrix are missing or contain errors (Type b problems), another algorithm,
E MBED [11] uses as inputs lower bound and upper bound
distance matrices, so the real (unknown) distance dij satisfies
lij ≤ dij ≤ uij . A preprocessing is performed to ensure
that these bounds do not violate the triangle and tetrangle
inequality. After preprocessing (distance smoothing), the coordinates are computed using the classical scaling. As a final
step, the E MBED algorithm contains a coordinate refinement
based on gradient descent. The distance smoothing based
on triangle inequality can be solved by Floyd-Warshall’s
all-pairs shortest paths algorithm. A parallel implementation
on graphics processing units (GPU) can be found in [12].
However, the drawback of using non-sparse distance matrices
is the O(n2 ) memory complexity, O(n3 ) time complexity for
triangle smoothing and O(n4 ) for tetrangle inequality.
Another class of algorithms solves the geometry problem
as a global optimization problem. The used methods are:
stress majoration [13], [14], global continuation on Gaussian
smoothing of the error function [15] (the DGSOL program),
monotonic basin hopping [16], data box searching [8], simulated annealing [17], and genetic algorithms [18], [19].
1:
2:
3:
Another approach is a geometric build-up procedure, presented in [5], [20]–[24], which places the atoms sequentially.
Given 4 non-planar atoms, the location of the 5-th can be
determined using sphere intersections.
In this paper we present the parallelized build-up using
breadth-first search, whose sequential version is described in
our previous work [25].
A similar technique is used for sparse multidimensional
scaling using a subset of landmark points which are embedded
using the classical MDS. The remaining points are determined
by triangulation relative to the landmarks [26].
In certain cases, the problem can be discretized: given a
sequence of atoms, so that the interatomic distances for 3
consecutive atoms are known, the 4-th atom’s coordinate can
be determined, by a symmetry ambiguity. For each atom, the
algorithm must take a binary decision of placement, and the
solution can be found by a binary tree traversal. A branchand-prune (B&P) algorithm was developed by [27], [28] (the
MD-jeep program), used for protein backbones.
In general, the complexity of the B&P algorithm grows
exponentially with the number of atoms, however for special
cases, the solution can be found in polynomial time, as shown
in [29]. The authors of the mentioned work suggest that on
real protein conformations, the B&P algorithm has polynomial
complexity (result found empirically). In an earlier work, a
“linearized embedding algorithm” for tree-graph had been
studied in [30].
A comprehensive study based on graph rigidity and graph
reduction can be found in [31].
III. O PEN CL
OpenCL (Open Computing Language) [32] is an open industry standard parallel computing framework, designed to address a heterogeneous computing environment, across devices
containing general-purpose GPUs (Graphics Processing Units)
and multi-core CPUs. OpenCL uses a subset of the ISO C99
language, with extensions for parallelism. An OpenCL platform consists of a host computer, which controls one or more
OpenCL devices. While OpenCL supports task-parallelism, the
primary model is the data-parallel programming. This model
is related to the Stream Processing paradigm, and to a relaxed
Single Instruction Multiple Data (SIMD) model.
The application, running on the host computer, controls
the execution flow. The host communicates with devices by
OpenCL command queues. A device contains one or more
compute units (CU), which in turn are composed of one or
more processing elements (PE), and local memory. Work items
are lightweight threads grouped into work groups. A work
group executes on a single compute unit.
The programmer partitions the data into work-items to be
executed in parallel and defines a kernel procedure. For data
partitioning, OpenCL gives the choice of uni-, bi- and threedimensional blocks called NDRange. Kernels are queued for
parallel execution on an OpenCL device. Each work-item has
associated an N (1, 2 or 3)-dimensional global index, accessi-
1422
ble through the get_global_id(dimension) function,
inside the kernels.
The OpenCL memory access model is hybrid. Compute
Units in a device have shared access to the device memory,
thus CUs are programmed using shared-memory parallel techniques, however communication with the host memory or with
other devices requires explicit memory block transfers, similar
to distributed systems.
the D matrix is loaded into the fast local memory of OpenCL
compute unit. Matrix operations and eigendecompositions are
done entirely in OpenCL.
IV. A LGORITHMS
Our approach consists of a parallel version of C MDS
followed by geometric buildup. We determine the vertex
coordinates by parallel breadth-first graph traversal. We will
present briefly the stages, and later the detailed explanation of
each. We aim to describe this method below.
The problem is given as a weighted undirected graph
G(V, E) where the vertices represents the atoms and the edges
connect the pairs for which we know the distance in advance.
For simplicity we use the term atoms also when referring to
the graph’s vertices.
First, we search for a clique of at least 4 atoms whose
pairwise distances are all known. Atoms in the clique are
positioned using the C MDS algorithm. If no such clique can be
found, the algorithm stops and reports failure, otherwise, we
continue with a parallel breadth-first graph traversal, starting
from the clique. At each step one atom is placed, using the
distances to already determined 4 atoms. The buildup process
can stop before traversing the entire graph, in which case the
currently determined partial solution is saved, and a new buildup is started from a previously unexplored region of the graph.
Partial solutions are merged together, if they have at least 4
non co-planar atoms in common. In the following sections we
detail each step:
Fig. 1.
C. Parallel breadth-first build-up
After coordinates of a base of 4 (or more) atoms had been
determined, we will place the remaining atoms successively,
by breadth-first graph traversal.
The key computation here is to determine the x coordinates
of an atom, based on knowing the exact distances di to at
least b = 4 atoms v1 , v2 , . . . , vb that were already placed.
This method computes the intersections of spheres (analogous
to the ruler-and-compass method in 2D), by solving the system
of equations, using linear algebra methods.
A. Clique search
The problem of finding a k-clique in a graph is known as
being NP-complete [33]. However, we are interested only in
finding candidate small cliques of 4-16 vertices, in the subgraph induced by a starting vertex and its neighborhood. We
implement in the OpenCL framework the algorithm based on
bit-level operations presented by D. Knuth in the Maxcliques
program [34], [35]. The subgraph formed by the starting vertex
and its neighbor vertices are represented by an adjacency
matrix, of size at most of 16 × 16 bits, so that each row fits
into a machine word. The maximal clique is found using the
bitwise intersection of the combinations of the rows. We assign
to each (previously unexplored) candidate vertex an OpenCL
work-item to find multiple cliques in parallel, in different areas
of the initial graph. We choose the limit of 16 for a clique size
for both computation time and storage constrains.
Breadth-first traversal and buildup.
kx − v1 k =
kx − v2 k =
..
.
d1
d2
kx − vb k =
db
By squaring the equations and subtracting the first equation
from the others we obtain a linear system:
2(x1 − v2 )x′ =
kv1 k2 − kv2 k2 + d22 − d21
2(v1 − v3 )x′ =
..
.
kv1 k2 − kv3 k2 + d23 − d21
2(v1 − vb )x′ =
kv1 k2 − kvb k2 + d2b − d21
(3)
where x is the 3-component row vector of the unknowns and
v1 . . . vb are the row vectors of the known coordinates. The
system can be written in matrix form as
Ax′ = B
B. Clique placement
The distance matrix D[n×n] of a clique is a complete,
symmetric matrix, where n is the size of the clique. We use the
C MDS algorithm presented in the introduction, implemented
in OpenCL. Since we limited the clique size to a small value,
(2)
(4)
The system of equations is overdetermined for b > 4.
We solve it in the least squares sense such as kAx′ − Bk
is minimal, by computing the pseudoinverse of A. If the
singular value decomposition (SVD) is A = UΣV′ , then
1423
the pseudoinverse A+ = VΣ+ U′ where Σ+ is computed
by reciprocating the nonzero diagonals and transposing Σ.
The following algorithm is a novel combination of the
parallel breadth-first graph search described in [36], [37] and
the geometric buildup from [5], [22]–[24].
1: procedure B FS -B UILDUP(G, d, P, Q)
Input:
G(E, V ) an undirected graph
dij
the lengths of the edges
P
the set of vertices that had been already determined
Q
a queue, initialized with the vertices from P and the
vertices adjacent to them
status of each vertex. We use 3 states: unvisited, visited and
determined (after the coordinates were computed).
Output:
P
the set of determined coordinates
xv
v ∈ P the list of three-dimensional coordinates of
vertices
errQ the set of vertices that were not determined due to lack
of basis vectors or numerical instability
2:
repeat
3:
errQ ← ∅
4:
while Q 6= ∅ do
5:
outQ ← ∅
6:
for each vertex a ∈ Q do in parallel
7:
v1 , . . . , vb ← vertices adjacent to a that vk ∈ P
8:
Ba ← {x[v1 ], . . . , x[vb ]} ⊲ build the basis matrix
9:
d1 , . . . , db ← the known distances from a to vk
10:
construct Eq. (2)
11:
convert to Eq. (3)
12:
if b ≥ 4 and STABLE(Ba ) then
13:
x[a] =SOLVE Eq. (3)
14:
P ← P ∪ {a}
15:
status[a] ← determined
16:
for each u adjacent to a do in parallel
17:
if status[u] = unvisited then
18:
status[u] ← visited
19:
outQ ← P ∪ {u}
20:
end if
⊲ Atomic compare-exchange
21:
end parallel for
22:
else
23:
errQ ← errQ ∪ {a}
24:
⊲ vertex placed in the error queue
25:
end if
26:
end parallel for
27:
Q ← outQ
28:
end while
29:
Q ← errQ
⊲ Re-iterate over error queue
30:
until no placement was done in the inner loop
31:
return P, errQ
32: end procedure
The evolution of the previous algorithm is depicted in Fig.
1. The inner region with vertices 1, . . . , 4 represents the initial
clique, while the outer shells represents successive frontiers of
the traversal. Vertices 5 and 6 are in the same frontier and are
evaluated in parallel. After each iteration, the output queue
contains the next frontier, which is used as input for the next
iteration (Fig. 2).
We use the two-level parallelism offered by OpenCL: On
the workgroup level, each workgroup gets assigned one vertex
from the input queue. Inside a workgroup, the work-items
are cooperating in the linear algebra routines at solving the
system of equations to determine the vertex coordinates. The
Fig. 2.
Mapping of the algorithm to OpenCL.
output queue is written concurrently by the workgroups at
the location provided by a global queue pointer which is
incremented using OpenCL’s intrinsic atom_inc function.
Another atomic function atom_xchg is used to avoid placing
the same vertex twice in the queue. Although atomic functions
can be a bottleneck to parallel algorithms, in our case they
are outweighed by the computation time spent in the linear
algebra routines. As stated before, we use a singular value
decomposition based solver. The heart of the algorithm is
the eigendecomposition of AA′ and A′ A. We choose the
iterative power method to find the largest eigenvalues and
their corresponding eigenvectors, due to it’s simplicity of
implementation.
1: procedure E IGEN -P OWER(A)
2:
x1 = random row vector
3:
for k ← 1, 2, . . . do
4:
yk+1 ← Ax′k
5:
λ ← xk Ax′k
′
/kyk+1 k
6:
xk+1 ← yk+1
7:
end for
8:
return λ, x
9: end procedure
The algorithm above converges in xk to the eigenvector
corresponding to the largest eigenvalue λ of the matrix A.
The procedure is repeated for every eigenvalue. The matrix
× vector, vector × scalar and vector norm operations are
parallelized.
A major drawback of the buildup procedure is that numerical errors are accumulating. We plot in Fig. 3. the
accumulation of the placement errors (in Å) over several
levels of breadth-first traversal. The displayed data are the
mean errors collected from running the algorithm over the
first 5 structures from Table I, until reaching an error of 1
Å. It requires approx. 40 buildup steps to plot Fig. 3. To
prevent unlimited growth of the errors, we stop expanding the
vertices which cannot be accurately determined during B FS B UILDUP. Unstable systems are detected by computing the
condition number κ = |ρmax |/|ρmin |, where ρmax , ρmin are
the maximal and minimal singular values of the matrix A
1424
10
Placement Error (A)
10
0
10
−10
10
−20
10
Fig. 3.
0
10
20
30
40
Iteration (breadth−first traversal level)
Accumulation of errors during breadth-first search.
from Eq. (4). If κ exceeds a given threshold, we don’t solve
the system, instead we put the problematic vertex in a special
error queue (errQ) and continue breadth-first traversal with
the remaining vertices in the input queue. The heuristic is that
in a second pass over errQ, it is possible that the system
becomes solvable by the additional vertices determined.
D. Merging of partial solutions
Steps A,B,C are reiterated until all vertices are visited.
Typically, each breadth-first exploration ends when the accumulated numerical errors exceeds a given threshold or missing
connectivity in the graph. We store the partial solutions and
start a new buildup from a different starting clique. It is
possible that two partial solutions share some vertices. If at
least 4 non-planar vertices are shared, their positions can
be aligned unambiguously. To align two partial solutions,
one must be translated and rotated to the other system of
coordinates, using the Kabsch algorithm [38]: Let X and Y
be the n × 3 matrices of the coordinates of the two structures,
both structures are centered around the origin. Then, compute
Z = XR, where R is the optimal rotation matrix that aligns
X to Y in a way that the RMSE E is minimized
E=
n
X
||(XR)i − yi ||2
(5)
i=1
1: procedure K ABSCH(X, Y)
Input: X[s×3] , Y[s×3] matrices of coordinates corresponding to s
common vertices found in two different buildups.
Output: R[3×3] the rotation matrix that minimizes eq. 5
2:
C ← X′ Y
⊲ compute the covariance matrix
3:
UΣV′ ← C
⊲ singular value decomposition
4:
R ← UV′
⊲ compute the rotation matrix
5:
Z ← XR
⊲ rotate the X coordinate set
6:
return R, Z
7: end procedure
Once two partial solutions are aligned, they are merged
to form one structure. The process is repeated for every
partial solution. The algorithm might fail to merge all partial
solutions, in case when the matrix is too sparse, in this case,
the output will be a list of separate portions of the determined
conformation. The merging process is shown in Fig. 4.
Finally, we present the complete algorithm which describes
our method, consisting of exploring and combining different
regions of the graph:
1: procedure C OMPLETE -B UILDUP(V,E)
2:
F ←∅
3:
W ←V
⊲ the set of unexplored vertices
4:
while W 6= ∅ do
5:
choose any v ∈ W
⊲ choose an unexplored vertex
6:
W ← W − {v}
7:
C ←F IND C LIQUE(v, neighbors(v))
8:
if |C| ≥ 4 then
9:
P ← P LACE C LIQUE(C)
10:
Q ← P ∪ neighbors(P )
11:
P ←B FS -B UILDUP(V, E, P, Q)
⊲ partial solution
12:
F ←K ABSCH(F, P )
⊲ try to align and merge
13:
W ←W −P
14:
end if
15:
end while
16:
return F
⊲ final solution
17: end procedure
E. Analysis
The time complexity of the algorithm depends largely on
the structure of the graph G(V, E). We consider a graph to
be 4-connected if we can’t find 3 vertices whose removal disconnects the graph. The BFB algorithm stops when a frontier
of a 4-connected subgraph is encountered, and is restarted by
the B UILDUP routine from a previously unexplored region. In
general, the breadth-first search takes O(|E| + |V |) time.
Aligning the two partial solutions Si , Sj by the Kabsch
routine requires time in O(s max(|S1 |, |S2 |)), where s is
the number of atoms that are used to compute the rotation
matrix (and which are part of both S1 and S2 ). We choose
4 ≤ s ≤ 16, independently of n = |V | the number of total
atoms in the structure. Therefore, the alignment routine’s time
complexity is bounded by O(n). However, depending on the
graph structure (sparsity), the algorithm is re-iterated. The
linear algebra routines are performed only on small matrices,
which are independent of |V |, so the asymptotic complexity
of the B FS - BUILDUP is linear.
V. E XPERIMENTAL RESULTS
We create a data set artificially, from a random set of
structures from the Protein Data Bank [39], with sizes ranging
from 516 atoms to 150,720, listed in Table I, by building
the graph associated to the molecule, with atoms as vertices
and edges connecting atoms that are less than 5 Å apart,
corresponding to the distances observable by the Nuclear
Overhauser Effect [40]. We also use the structures tested in
the related works [22], [24], [25].
We determine the graph density and clustering coefficient
for the data structures. The density of an undirected graph
G(V, E) is defined as:
ρ=
2|E|
|V | · (|V | − 1)
(6)
while the average local clustering coefficient is:
c=
|V |
X
i
2|Ei |
|Vi | · (|Vi | − 1)
(7)
where Vi and Ei are the vertices and edges of the subgraph
induced by the vertex i and its adjacenct nodes [41].
1425
TABLE II
R ESULTS OF THE BUILD - UP ALGORITHM .
Name
PDB Id
Atoms
Placed
1RDG
1HOE
1LFB
1PHT
1POA
1AX8
2KXD
1VMP
1HAA
1F39
1GPV
1RGS
1BPM
2G33
3R1C
2F8V
2XTL
3FXI
1HMV
2VZ8
2VZ9
1AON
3OAA
3K1Q
1VRI
516
581
641
988
1,067
1,074
1,142
1,166
1,310
1,653
1,842
2,059
3,673
4,658
6,865
7,409
7,974
12,500
29,596
30,281
31,949
58,870
99,573
101,798
150,720
511
574
628
949
1,038
1,059
1,140
1,161
1,306
1,612
1,797
1,986
3,627
4,501
6,824
7,173
7,895
12,106
14,300
29,226
30,795
57,999
97,744
93,266
148,452
OpenCL double prec.
Duration
Error
[sec]
[Å]
0.19
0.31
0.34
0.19
0.39
0.56
0.30
0.18
0.35
0.63
0.56
0.86
1.24
1.76
3.35
3.11
2.73
6.17
17.66
18.18
18.04
34.43
101.84
101.87
158.69
Traversal
Width
4.54e-04
1.03e-06
9.28e-07
9.20e-07
8.51e-07
1.16e-06
1.02e-06
1.70e-06
1.29e-06
1.62e-06
1.79e-06
1.87e-06
1.84e-06
1.32e-03
1.99e-06
4.43e-06
2.41e-06
9.75e-04
9.82e-04
2.56e-01
4.06e-05
1.69e-04
1.78e-02
1.75e-01
5.16e+01
OpenCL single prec.
Duration
Error
[sec]
[Å]
171
136
173
235
243
288
296
353
314
325
258
168
923
210
305
495
1,387
818
581
1,732
1,965
1,839
1,328
2,495
27,815
0.09
0.11
0.13
0.10
0.15
0.22
0.11
0.06
0.12
0.35
0.26
0.39
0.61
0.76
1.37
1.38
1.28
2.77
8.83
8.75
9.74
16.98
53.71
51.95
75.53
Java reference
Duration
[sec]
4.41e-04
1.90e-04
8.48e-04
1.91e-03
4.86e-03
9.21e-03
5.82e-05
3.90e-04
1.49e-04
1.45e-02
5.75e-02
2.79e-04
1.61e-01
7.01e-03
1.90e-01
7.15e-03
3.43e+00
6.87e-01
4.14e-02
1.17e-01
1.25e+00
4.04e+00
5.20e+00
2.66e+01
2.31e+02
0.12
0.14
0.10
0.21
0.26
0.27
0.14
0.28
0.28
0.32
0.48
0.46
0.53
1.05
3.56
3.06
2.96
9.76
27.22
35.67
54.89
146.91
226.65
179.53
541.81
The local clustering coefficient shows how close the neighbors of a vertex are related to a clique (i.e., how well the
build-up algorithm reconstructs the coordinates).
maximum vertex error defined as
s
1 X
(kxi − xj k − δi,j )2
Emax = max
i
ni j
Table II depicts the results of the build-up algorithm. We
compare the execution times and errors of two OpenCL
implementations (double and single precision) to a reference
Java implementation (in double precision) from our previous
work [25] in Fig. 5. The tests (Java and OpenCL) were run
on an Intel Core i7-2600K CPU @ 3.4GHz, with 4 hyperthreading cores. We measure the execution time in seconds,
the number of atoms successfully determined, the traversal
width (the maximum number of vertices placed in parallel,
and it measures the parallelism of the B FS algorithm), and the
where the summation is taken over the ni neighbors of vertex
i.
Next, we concentrate only of the BFS-Buildup kernel. To
test the scalability of our implementation, we note that the
execution time can be expressed as Tn,p,h where n is the
problem size (correlated with the number of atoms and known
edges), p the number of threads (OpenCL work-items), and h
the chosen hardware (multi-core CPU or GPU).
We vary the problem size from molecules containing 1074
to 58,870 atoms and, independently, the maximum number of
(8)
12
11
14
6
4
14
9
16 13
6
7
5
4
2
2
11
12
9 7
11
16 13
12
0
2
3
0
15
15
1
3
(a) Partial Solution 1
4
5
0
1
(b) Partial Solution 2
(c) Merged
Fig. 4. Merging of partial solutions. The common atoms that serve as basis for the Kabsch algorithm are highlighted. Note that the first solution is rotated
in 3D for alignment in the merged solution. Dashed lines link pairs of atoms with known distances, while solid lines show chemical bonds.
1426
TABLE I
T HE MOLECULES USED IN
THE EXPERIMENTS , CUTOFF DISTANCE
5 Å.
Name
PDB ID
# of
Atoms
Min
deg
Avg
deg
# known
distances
Density
(%)
Clustering
coeff, c
1RDG
1HOE
1LFB
1PHT
1POA
1AX8
2KXD
1VMP
1HAA
1F39
1GPV
1RGS
1BPM
2G33
3R1C
2F8V
2XTL
3FXI
1HMV
2VZ8
2VZ9
1AON
3OAA
3K1Q
1VRI
516
581
641
988
1,067
1,074
1,142
1,166
1,310
1,653
1,842
2,059
3,673
4,658
6,865
7,409
7,974
12,500
29,596
30,281
31,949
58,870
99,573
101,798
150,720
1
3
5
3
2
2
9
1
8
4
7
3
3
4
3
1
4
3
3
3
3
3
3
3
3
22.51
22.72
21.76
27.77
23.48
23.28
39.29
41.20
40.92
23.17
25.91
23.04
24.40
22.67
27.30
22.29
25.03
22.95
23.37
23.14
23.14
24.18
23.54
24.40
25.32
5,808
6,600
6,974
13,720
12,525
12,502
22,436
24,020
26,801
19,154
23,863
23,720
44,818
52,807
93,699
82,587
99,805
143,426
345,757
350,358
369,726
711,737
1,171,764
1,241,974
1,907,977
4.37
3.92
3.40
2.81
2.20
2.17
3.44
3.54
3.13
1.40
1.41
1.12
0.66
0.49
0.40
0.30
0.31
0.18
0.08
0.08
0.07
0.04
0.02
0.02
0.02
0.578
0.594
0.619
0.597
0.574
0.591
0.608
0.614
0.603
0.575
0.609
0.589
0.567
0.617
0.605
0.578
0.553
0.564
0.586
0.570
0.568
0.585
0.566
0.580
0.569
this to the overhead associated to launching OpenCL kernels.
For the small problem size (Fig.7), the speedup saturates
quicker, due to Amdahl’s law: at 64 threads and above the
speedup remains at about 16 times on GPU, even if there are
more computing resources available. Also, the graph reveals
that the GPU doesn’t offer any speedup over the CPU in this
case.
Fig. 5.
64
parallel launched threads from p=1 to p=4096. We run each
set of tests both on the Intel Core i7 CPU and the Nvidia
GeForce GTX 560Ti GPU.
For the given problem size and a hardware architecture, we
measure the speedup of the algorithm as
Sn,p,h
GPU (T1 / Tp)
CPU (T1 / Tp)
TCPU / TGPU
32
16
Speedup
Tn,1,h
=
Tn,p,h
Execution times for 3 different implementations.
(9)
8
4
2
1
where Tn,1,h is the execution time of the OpenCL kernel on
a single thread. The OpenCL framework allows launching a
much larger number of work-items than the physical execution
units found on the device, in this case some of the threads
are executed sequentially. We launch p threads then wait for
completion and repeat the operation until all items in the input
queue of algorithm BFS-Buildup are processed:
0.5
0.25
1
4
16
64
256
1024
4096
Number of threads (p)
Fig. 6.
atoms)
Speedup of the BFS kernel, in case of a large molecule (≈ 50,000
while(input queue not empty)
clEnqueueNDRangeKernel(global_work_size=p)
clFinish()
64
GPU (T1 / Tp)
CPU (T1 / Tp)
TCPU/TGPU
32
16
Speedup
The results for the large problem size are plotted in Fig.
6. We also compare the execution times on the two devices;
on the graph we notice that the OpenCL kernel running on
the GPU starts to outperform the kernel running on multi-core
CPU (TCP U /TGP U > 1) starting from about p > 128 threads.
The figure also reveals that the speedup saturates starting at
a given number of threads related to the physical number of
cores on the device: the GTX 560Ti GPU has 8 Streaming
Multiprocessors (each containing 48 scalar processors), while
the tested Core i7 processor has 4 hyper-threading cores (each
core is capable of running 2 threads). We notice a small
increase of speedup even when the number of launched threads
exceeds the number of physical execution units. We attribute
8
4
2
1
0.5
0.25
1
4
16
64
256
1024
4096
Number of threads (p)
Fig. 7.
atoms)
1427
Speedup of the BFS kernel, in case of a smaller molecule (≈ 1000
VI. C ONCLUSIONS
We have presented a novel parallel solution for the molecular distance problem. From experiments on parallel systems
(multi-core CPU and GPU), we conclude that our implementation show moderate scalability. For very large molecules and
datasets, these improvements may become critical. Future improvements of the algorithms will include optimization of the
linear algebra OpenCL routines and parallel computation of
different partial solutions. We also plan to improve scalability.
R EFERENCES
[1] T. Kamada and S. Kawai, “An algorithm for drawing general undirected
graphs,” Inf. Process. Lett., vol. 31, pp. 7–15, April 1989.
[2] U. Brandes and C. Pich, “Eigensolver methods for progressive multidimensional scaling of large data,” in Graph Drawing, ser. Lecture Notes
in Computer Science, M. Kaufmann and D. Wagner, Eds. Springer
Berlin / Heidelberg, 2007, vol. 4372, pp. 42–53.
[3] R. Davidson and D. Harel, “Drawing graphs nicely using simulated
annealing,” ACM Trans. Graph., vol. 15, pp. 301–331, October 1996.
[4] W. Torgerson, “Multidimensional scaling: I. Theory and method,” Psychometrika, vol. 17, no. 4, pp. 401–419, Dec. 1952.
[5] I. Borg and P. Groenen, Modern Multidimensional Scaling: Theory and
Applications. Springer, 2005.
[6] G. Seber, Multivariate observations, ser. Wiley series in probability and
statistics. Wiley-Interscience, 2004.
[7] T. F. Havel, “Distance geometry: Theory, algorithms and chemical
applications,” in Encyclopedia of Computational Chemistry. Wiley,
1998, pp. 723–742.
[8] W. Glunt, T. L. Hayden, and M. Raydan, “Molecular conformations from
distance matrices,” J. Comput. Chem., vol. 14, pp. 114–120, January
1993.
[9] C. Lavor, L. Liberti, and N. Maculan, “Molecular distance geometry
problem,” in Encyclopedia of Optimization, C. A. Floudas and P. M.
Pardalos, Eds. Springer US, 2009, pp. 2304–2311.
[10] L. M. Blumenthal, Theory and applications of distance geometry.
Bronx, NY: Chelsea, 1970.
[11] T. Havel, I. Kuntz, and G. Crippen, “The theory and practice of distance
geometry,” Bulletin of Mathematical Biology, vol. 45, pp. 665–720,
1983.
[12] M. Pharr and R. Fernando, GPU Gems 2: programming techniques
for high-performance graphics and general-purpose computation.
Addison-Wesley Professional, 2005. [Online]. Available: http://http.
developer.nvidia.com/GPUGems2/gpugems2 chapter43.html
[13] J. de Leeuw, “Applications of convex analysis to multidimensional
scaling,” in Recent Developments in Statistics, J. Barra, F. Brodeau,
G. Romier, and B. V. Cutsem, Eds.
Amsterdam: North Holland
Publishing Company, 1977, pp. 133–146.
[14] E. R. Gansner, Y. Koren, and S. North, “Graph drawing by stress
majorization,” in Graph Drawing. Springer, 2004, pp. 239–250.
[15] J. J. Moré and Z. Wu, “Distance geometry optimization for protein
structures,” J. of Global Optimization, vol. 15, pp. 219–234, October
1999.
[16] A. Grosso, M. Locatelli, and F. Schoen, “Solving molecular distance
geometry problems by global optimization algorithms,” Computational
Optimization and Applications, vol. 43, pp. 23–37, 2009.
[17] M. Nilges, A. M. Gronenborn, A. T. Brunger, and M. G. Clore,
“Determination of three-dimensional structures of proteins by simulated
annealing with interproton distance restraints. application to crambin,
potato carboxypeptidase inhibitor and barley serine proteinase inhibitor
2,” Protein Eng., vol. 2, no. 1, pp. 27–38, Apr. 1988.
[18] A. H. C. Kampen, L. M. C. Buydens, C. B. Lucasius, and M. J. J. Blommers, “Optimisation of metric matrix embedding by genetic algorithms,”
Journal of Biomolecular NMR, vol. 7, pp. 214–224, 1996.
[19] R. Leardi, Nature-inspired methods in chemometrics: genetic algorithms
and artificial neural networks, ser. Data handling in science and technology. Elsevier, 2003.
[20] Q. Dong and Z. Wu, “A linear-time algorithm for solving the molecular
distance geometry problem with exact inter-atomic distances,” Journal
of Global Optimization, vol. 22, pp. 365–375, 2002.
[21] ——, “A geometric build-up algorithm for solving the molecular
distance geometry problem with sparse distance data,” J. of Global
Optimization, vol. 26, pp. 321–333, July 2003.
[22] D. Wu and Z. Wu, “An updated geometric build-up algorithm for solving
the molecular distance geometry problems with sparse distance data,” J.
of Global Optimization, vol. 37, pp. 661–673, April 2007.
[23] R. Davis, C. Ernst, and D. Wu, “Protein structure determination via
an efficient geometric build-up algorithm,” BMC Structural Biology,
vol. 10, no. Suppl 1, p. S7, 2010.
[24] A. Sit, Z. Wu, and Y. Yuan, “A geometric buildup algorithm for
the solution of the distance geometry problem using Least-Squares
approximation,” Bulletin of Mathematical Biology, 2007.
[25] L. Fabry-Asztalos, I. Lőrentz, and R. Andonie, “Molecular Distance
Geometry Optimization using Geometric Build-up and Evolutionary
Techniques on GPU,” in Proc. of the IEEE Symposium on Computational
Intelligence in Bioinformatics and Computational Biology CIBCB, 2012,
to be published.
[26] V. D. Silva and J. B. Tenenbaum, “Sparse multidimensional scaling using
landmark points,” Stanford, Tech. Rep. 6, 2004.
[27] C. Lavor, L. Liberti, N. Maculan, and A. Mucherino, “Recent advances
on the discretizable molecular distance geometry problem,” European
Journal of Operational Research, 2011.
[28] A. Mucherino, C. Lavor, L. Liberti, and E.-G. Talbi, “A parallel version
of the branch and prune algorithm for the molecular distance geometry
problem,” in Proceedings of the ACS/IEEE International Conference on
Computer Systems and Applications, ser. AICCSA. Washington, DC,
USA: IEEE Computer Society, 2010, pp. 1–6.
[29] L. Liberti, C. Lavor, B. Masson, and A. Mucherino, “Polynomial cases
of the discretizable molecular distance geometry problem,” CoRR, vol.
abs/1103.1264, 2011, informal publication.
[30] G. M. Crippen, “Linearized embedding: a new metric matrix algorithm for calculating molecular conformations subject to geometric
constraints,” J. Comput. Chem., vol. 10, pp. 896–902, October 1989.
[31] B. A. Hendrickson, “The molecular problem: Determining conformation
from pairwise distances,” Cornell University, Ithaca, NY, USA, Tech.
Rep., 1990.
[32] Khronos OpenCL Working Group, The OpenCL Specification, version
1.1, 2010. [Online]. Available: http://www.khronos.org/registry/cl/specs/
opencl-1.1.pdf
[33] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide
to the Theory of NP-Completeness (Series of Books in the Mathematical
Sciences). W. H. Freeman, 1979.
[34] D. Knuth. (2008) MAXCLIQUES algorithm for listing all maximal
cliques. [Online]. Available: http://www-cs-staff.stanford.edu/∼uno/
programs.html
[35] N. Jardine and R. Sibson, Mathematical taxonomy, ser. Wiley series
in probability and mathematical statistics. Probability and mathematical
statistics. Wiley, 1971.
[36] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient Parallel Graph
Exploration on Multi-Core CPU and GPU,” in International Conference
on Parallel Architectures and Compilation Techniques (PACT), 2011.
[37] D. Merrill, M. Garland, and A. Grimshaw, “High Performance and
Scalable GPU Graph Traversal,” Department of Computer Science,
University of Virginia, Tech. Rep., 2011.
[38] W. Kabsch, “A solution for the best rotation to relate two sets of vectors.”
Acta Crystallographica, vol. 32, pp. 922–923, 1976.
[39] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat,
H. Weissig, I. N. Shindyalov, and P. E. Bourne, “The Protein Data
Bank,” Nucleic Acids Research, vol. 28, pp. 235–242, 2000. [Online].
Available: http://www.rcsb.org/pdb/
[40] K. Wüthrich, “Protein structure determination in solution by NMR
spectroscopy.” Journal of Biological Chemistry, vol. 265, no. 36, pp.
22 059–22 062, dec 1990.
[41] R. D. Luce and A. D. Perry, “A method of matrix analysis of group
structure,” Psychometrika, 1949.
1428
Download