Solving Molecular Distance Geometry Problems in OpenCL István Lőrentz Răzvan Andonie Levente Fabry-Asztalos Electronics and Computers Department, Transilvania University, Braşov, Romania Email: isti spl@yahoo.com Computer Science Department, Central Washington University, Ellensburg, WA, USA and Electronics and Computers Department, Transilvania University, Braşov, Romania Email: andonie@cwu.edu Department of Chemistry, Central Washington University, Ellensburg, WA, USA, Email: FabryL@cwu.edu Abstract—We focus on the following computational chemistry problem: Given a subset of the exact distances between atoms, reconstruct the three-dimensional position of each atom in the given molecule. The distance matrix is generally sparse. This problem is both important and challenging. Our contribution is a novel combination of two known techniques (parallel breadthfirst search and geometric buildup) and its OpenCL parallel implementation. The approach has the potential to speed up computation of three-dimensional structures of molecules - a critical process in computational chemistry. From experiments on multi-core CPUs and graphic processing units, we conclude that, for sufficient large problems, our implementation shows a moderate scalability. I. I NTRODUCTION Knowing the three-dimensional structures of molecules is critical in many scientific fields, especially in chemical biology and medicinal chemistry. They play a critical role in molecular interactions (e.g. between two biologically significant large proteins and between an enzyme and its therapeutically important small-molecule inhibitors) and strongly contribute to chemical and biological properties. Experimentally structural characteristics can be determined using X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy, while theoretically they can be determined using various potential energy minimization and bioinformatics techniques. In chemistry distance geometry problems often arise, especially when determining the exact structures of large molecules (e.g. proteins). Furthermore, many chemical structures (large and small) cannot be crystallized; therefore, their structures cannot be determined using X-ray crystallography. In addition, when structures are determined using NMR spectroscopy, especially of very large molecules, often only partial structural information can be determined due to the experimental limitations of the method. The Molecular Distance Geometry Problem (MDGP) aims to reconstruct the three-dimensional position (x, y, z coordinates) of each atom from the pairwise distances given. More formally, given a set of n atoms, and the D = {dij } set of Euclidean distances between atoms, the task is to find 978-1-4673-1653-8/12/$31.00 '2012 IEEE positions x1 , ..., xn ∈ R3 of the atoms in the molecule such that ||xi − xj || = dij . The geometry problem, besides the chemistry applications, has applications in other fields, like graph drawing [1]–[3] and placement of wireless sensors. In graph theory, the geometric distance problem corresponds to finding the embedding of an undirected graph to R3 , by considering the atoms as a graph’s vertices and the known interatomic distances as the weights of the edges. If all pairwise distances are accurately known (in a complete graph), the classical multidimensional scaling solves the problem by eigendecomposition of the squared, normalized distance matrix [4]–[6]. The problem can be also regarded as a nonlinear global optimization problem - to find the conformation X that minimizes the given stress function: X 2 σ 2 (X) = wij (kxi − xj k − δij ) (1) i<j≤n where wij is a positive weight associated to the (i, j) pair, wij = 1 if the distance (i, j) is known and 0 otherwise and δij is the measured, known distance. However, real-life data is often characterized by: • Incomplete distances. Usually only distances between certain types of atoms closer than a given threshold are known. • Imprecise distances, affected by measurement error. In some cases each distance is given as lower and upper bound. Based on these input data, the problem to be solved falls into one of the following types: a. Interatomic distances between all pairs are known precisely (corresponding to a complete graph with n(n − 1)/2 edges). b. All interatomic distances are known, but affected by error. An upper and a lower bound is given for each. c. Only a subset of the exact distances are known. The resulting distance matrix is sparse. d. A subset of distances affected by measurement error is known. 1421 We focus on Type c problems. Our contribution is a novel combination of two known techniques (parallel breadth-first search and geometric buildup) for solving such problems. We have implemented and tested our model on a multi-core system using OpenCL. After an overview of related work (Section II) and the OpenCL platform (Section III), in Section IV we describe our method. Section V contains the numerical results of several experiments and, in Section VI, we have our final remarks. II. R ELATED WORK Good overviews of existing methods can be found in [7]– [9]. For Type a problems, when all distances are known, a classical approach is multidimensional scaling (C MDS) [4], [10]. C MDS uses eigendecomposition to compute a threedimensional representation of the data given by the distance matrix D[n×n] : procedure C MDS(D) D(2) ← {d2ij } ⊲ squared distances J ← I − {1/n}[n×n] 4: B ← − 21 JD(2) J 5: compute eigendecomposition: QΛQ′ = B 6: let λ1 ≥ λ2 ≥ λ3 > 0 ⊲ the largest 3 positive ⊲ eigenvalues of B 7: Q+ ← [q1 , q2 , q3 ] ⊲ the matrix formed by the ⊲ associated eigenvectors 8: return X ← Q+ Λ1/2 9: end procedure In the above algorithm, D(2) represents the distance matrix, with each element squared, I the [n × n] identity matrix, and B is the double-centered matrix (sums of rows and sums of columns are zero). The resulting matrix, X[n×3] contains the x, y, z coordinates of all the n atoms. When some of the entries of the D matrix are missing or contain errors (Type b problems), another algorithm, E MBED [11] uses as inputs lower bound and upper bound distance matrices, so the real (unknown) distance dij satisfies lij ≤ dij ≤ uij . A preprocessing is performed to ensure that these bounds do not violate the triangle and tetrangle inequality. After preprocessing (distance smoothing), the coordinates are computed using the classical scaling. As a final step, the E MBED algorithm contains a coordinate refinement based on gradient descent. The distance smoothing based on triangle inequality can be solved by Floyd-Warshall’s all-pairs shortest paths algorithm. A parallel implementation on graphics processing units (GPU) can be found in [12]. However, the drawback of using non-sparse distance matrices is the O(n2 ) memory complexity, O(n3 ) time complexity for triangle smoothing and O(n4 ) for tetrangle inequality. Another class of algorithms solves the geometry problem as a global optimization problem. The used methods are: stress majoration [13], [14], global continuation on Gaussian smoothing of the error function [15] (the DGSOL program), monotonic basin hopping [16], data box searching [8], simulated annealing [17], and genetic algorithms [18], [19]. 1: 2: 3: Another approach is a geometric build-up procedure, presented in [5], [20]–[24], which places the atoms sequentially. Given 4 non-planar atoms, the location of the 5-th can be determined using sphere intersections. In this paper we present the parallelized build-up using breadth-first search, whose sequential version is described in our previous work [25]. A similar technique is used for sparse multidimensional scaling using a subset of landmark points which are embedded using the classical MDS. The remaining points are determined by triangulation relative to the landmarks [26]. In certain cases, the problem can be discretized: given a sequence of atoms, so that the interatomic distances for 3 consecutive atoms are known, the 4-th atom’s coordinate can be determined, by a symmetry ambiguity. For each atom, the algorithm must take a binary decision of placement, and the solution can be found by a binary tree traversal. A branchand-prune (B&P) algorithm was developed by [27], [28] (the MD-jeep program), used for protein backbones. In general, the complexity of the B&P algorithm grows exponentially with the number of atoms, however for special cases, the solution can be found in polynomial time, as shown in [29]. The authors of the mentioned work suggest that on real protein conformations, the B&P algorithm has polynomial complexity (result found empirically). In an earlier work, a “linearized embedding algorithm” for tree-graph had been studied in [30]. A comprehensive study based on graph rigidity and graph reduction can be found in [31]. III. O PEN CL OpenCL (Open Computing Language) [32] is an open industry standard parallel computing framework, designed to address a heterogeneous computing environment, across devices containing general-purpose GPUs (Graphics Processing Units) and multi-core CPUs. OpenCL uses a subset of the ISO C99 language, with extensions for parallelism. An OpenCL platform consists of a host computer, which controls one or more OpenCL devices. While OpenCL supports task-parallelism, the primary model is the data-parallel programming. This model is related to the Stream Processing paradigm, and to a relaxed Single Instruction Multiple Data (SIMD) model. The application, running on the host computer, controls the execution flow. The host communicates with devices by OpenCL command queues. A device contains one or more compute units (CU), which in turn are composed of one or more processing elements (PE), and local memory. Work items are lightweight threads grouped into work groups. A work group executes on a single compute unit. The programmer partitions the data into work-items to be executed in parallel and defines a kernel procedure. For data partitioning, OpenCL gives the choice of uni-, bi- and threedimensional blocks called NDRange. Kernels are queued for parallel execution on an OpenCL device. Each work-item has associated an N (1, 2 or 3)-dimensional global index, accessi- 1422 ble through the get_global_id(dimension) function, inside the kernels. The OpenCL memory access model is hybrid. Compute Units in a device have shared access to the device memory, thus CUs are programmed using shared-memory parallel techniques, however communication with the host memory or with other devices requires explicit memory block transfers, similar to distributed systems. the D matrix is loaded into the fast local memory of OpenCL compute unit. Matrix operations and eigendecompositions are done entirely in OpenCL. IV. A LGORITHMS Our approach consists of a parallel version of C MDS followed by geometric buildup. We determine the vertex coordinates by parallel breadth-first graph traversal. We will present briefly the stages, and later the detailed explanation of each. We aim to describe this method below. The problem is given as a weighted undirected graph G(V, E) where the vertices represents the atoms and the edges connect the pairs for which we know the distance in advance. For simplicity we use the term atoms also when referring to the graph’s vertices. First, we search for a clique of at least 4 atoms whose pairwise distances are all known. Atoms in the clique are positioned using the C MDS algorithm. If no such clique can be found, the algorithm stops and reports failure, otherwise, we continue with a parallel breadth-first graph traversal, starting from the clique. At each step one atom is placed, using the distances to already determined 4 atoms. The buildup process can stop before traversing the entire graph, in which case the currently determined partial solution is saved, and a new buildup is started from a previously unexplored region of the graph. Partial solutions are merged together, if they have at least 4 non co-planar atoms in common. In the following sections we detail each step: Fig. 1. C. Parallel breadth-first build-up After coordinates of a base of 4 (or more) atoms had been determined, we will place the remaining atoms successively, by breadth-first graph traversal. The key computation here is to determine the x coordinates of an atom, based on knowing the exact distances di to at least b = 4 atoms v1 , v2 , . . . , vb that were already placed. This method computes the intersections of spheres (analogous to the ruler-and-compass method in 2D), by solving the system of equations, using linear algebra methods. A. Clique search The problem of finding a k-clique in a graph is known as being NP-complete [33]. However, we are interested only in finding candidate small cliques of 4-16 vertices, in the subgraph induced by a starting vertex and its neighborhood. We implement in the OpenCL framework the algorithm based on bit-level operations presented by D. Knuth in the Maxcliques program [34], [35]. The subgraph formed by the starting vertex and its neighbor vertices are represented by an adjacency matrix, of size at most of 16 × 16 bits, so that each row fits into a machine word. The maximal clique is found using the bitwise intersection of the combinations of the rows. We assign to each (previously unexplored) candidate vertex an OpenCL work-item to find multiple cliques in parallel, in different areas of the initial graph. We choose the limit of 16 for a clique size for both computation time and storage constrains. Breadth-first traversal and buildup. kx − v1 k = kx − v2 k = .. . d1 d2 kx − vb k = db By squaring the equations and subtracting the first equation from the others we obtain a linear system: 2(x1 − v2 )x′ = kv1 k2 − kv2 k2 + d22 − d21 2(v1 − v3 )x′ = .. . kv1 k2 − kv3 k2 + d23 − d21 2(v1 − vb )x′ = kv1 k2 − kvb k2 + d2b − d21 (3) where x is the 3-component row vector of the unknowns and v1 . . . vb are the row vectors of the known coordinates. The system can be written in matrix form as Ax′ = B B. Clique placement The distance matrix D[n×n] of a clique is a complete, symmetric matrix, where n is the size of the clique. We use the C MDS algorithm presented in the introduction, implemented in OpenCL. Since we limited the clique size to a small value, (2) (4) The system of equations is overdetermined for b > 4. We solve it in the least squares sense such as kAx′ − Bk is minimal, by computing the pseudoinverse of A. If the singular value decomposition (SVD) is A = UΣV′ , then 1423 the pseudoinverse A+ = VΣ+ U′ where Σ+ is computed by reciprocating the nonzero diagonals and transposing Σ. The following algorithm is a novel combination of the parallel breadth-first graph search described in [36], [37] and the geometric buildup from [5], [22]–[24]. 1: procedure B FS -B UILDUP(G, d, P, Q) Input: G(E, V ) an undirected graph dij the lengths of the edges P the set of vertices that had been already determined Q a queue, initialized with the vertices from P and the vertices adjacent to them status of each vertex. We use 3 states: unvisited, visited and determined (after the coordinates were computed). Output: P the set of determined coordinates xv v ∈ P the list of three-dimensional coordinates of vertices errQ the set of vertices that were not determined due to lack of basis vectors or numerical instability 2: repeat 3: errQ ← ∅ 4: while Q 6= ∅ do 5: outQ ← ∅ 6: for each vertex a ∈ Q do in parallel 7: v1 , . . . , vb ← vertices adjacent to a that vk ∈ P 8: Ba ← {x[v1 ], . . . , x[vb ]} ⊲ build the basis matrix 9: d1 , . . . , db ← the known distances from a to vk 10: construct Eq. (2) 11: convert to Eq. (3) 12: if b ≥ 4 and STABLE(Ba ) then 13: x[a] =SOLVE Eq. (3) 14: P ← P ∪ {a} 15: status[a] ← determined 16: for each u adjacent to a do in parallel 17: if status[u] = unvisited then 18: status[u] ← visited 19: outQ ← P ∪ {u} 20: end if ⊲ Atomic compare-exchange 21: end parallel for 22: else 23: errQ ← errQ ∪ {a} 24: ⊲ vertex placed in the error queue 25: end if 26: end parallel for 27: Q ← outQ 28: end while 29: Q ← errQ ⊲ Re-iterate over error queue 30: until no placement was done in the inner loop 31: return P, errQ 32: end procedure The evolution of the previous algorithm is depicted in Fig. 1. The inner region with vertices 1, . . . , 4 represents the initial clique, while the outer shells represents successive frontiers of the traversal. Vertices 5 and 6 are in the same frontier and are evaluated in parallel. After each iteration, the output queue contains the next frontier, which is used as input for the next iteration (Fig. 2). We use the two-level parallelism offered by OpenCL: On the workgroup level, each workgroup gets assigned one vertex from the input queue. Inside a workgroup, the work-items are cooperating in the linear algebra routines at solving the system of equations to determine the vertex coordinates. The Fig. 2. Mapping of the algorithm to OpenCL. output queue is written concurrently by the workgroups at the location provided by a global queue pointer which is incremented using OpenCL’s intrinsic atom_inc function. Another atomic function atom_xchg is used to avoid placing the same vertex twice in the queue. Although atomic functions can be a bottleneck to parallel algorithms, in our case they are outweighed by the computation time spent in the linear algebra routines. As stated before, we use a singular value decomposition based solver. The heart of the algorithm is the eigendecomposition of AA′ and A′ A. We choose the iterative power method to find the largest eigenvalues and their corresponding eigenvectors, due to it’s simplicity of implementation. 1: procedure E IGEN -P OWER(A) 2: x1 = random row vector 3: for k ← 1, 2, . . . do 4: yk+1 ← Ax′k 5: λ ← xk Ax′k ′ /kyk+1 k 6: xk+1 ← yk+1 7: end for 8: return λ, x 9: end procedure The algorithm above converges in xk to the eigenvector corresponding to the largest eigenvalue λ of the matrix A. The procedure is repeated for every eigenvalue. The matrix × vector, vector × scalar and vector norm operations are parallelized. A major drawback of the buildup procedure is that numerical errors are accumulating. We plot in Fig. 3. the accumulation of the placement errors (in Å) over several levels of breadth-first traversal. The displayed data are the mean errors collected from running the algorithm over the first 5 structures from Table I, until reaching an error of 1 Å. It requires approx. 40 buildup steps to plot Fig. 3. To prevent unlimited growth of the errors, we stop expanding the vertices which cannot be accurately determined during B FS B UILDUP. Unstable systems are detected by computing the condition number κ = |ρmax |/|ρmin |, where ρmax , ρmin are the maximal and minimal singular values of the matrix A 1424 10 Placement Error (A) 10 0 10 −10 10 −20 10 Fig. 3. 0 10 20 30 40 Iteration (breadth−first traversal level) Accumulation of errors during breadth-first search. from Eq. (4). If κ exceeds a given threshold, we don’t solve the system, instead we put the problematic vertex in a special error queue (errQ) and continue breadth-first traversal with the remaining vertices in the input queue. The heuristic is that in a second pass over errQ, it is possible that the system becomes solvable by the additional vertices determined. D. Merging of partial solutions Steps A,B,C are reiterated until all vertices are visited. Typically, each breadth-first exploration ends when the accumulated numerical errors exceeds a given threshold or missing connectivity in the graph. We store the partial solutions and start a new buildup from a different starting clique. It is possible that two partial solutions share some vertices. If at least 4 non-planar vertices are shared, their positions can be aligned unambiguously. To align two partial solutions, one must be translated and rotated to the other system of coordinates, using the Kabsch algorithm [38]: Let X and Y be the n × 3 matrices of the coordinates of the two structures, both structures are centered around the origin. Then, compute Z = XR, where R is the optimal rotation matrix that aligns X to Y in a way that the RMSE E is minimized E= n X ||(XR)i − yi ||2 (5) i=1 1: procedure K ABSCH(X, Y) Input: X[s×3] , Y[s×3] matrices of coordinates corresponding to s common vertices found in two different buildups. Output: R[3×3] the rotation matrix that minimizes eq. 5 2: C ← X′ Y ⊲ compute the covariance matrix 3: UΣV′ ← C ⊲ singular value decomposition 4: R ← UV′ ⊲ compute the rotation matrix 5: Z ← XR ⊲ rotate the X coordinate set 6: return R, Z 7: end procedure Once two partial solutions are aligned, they are merged to form one structure. The process is repeated for every partial solution. The algorithm might fail to merge all partial solutions, in case when the matrix is too sparse, in this case, the output will be a list of separate portions of the determined conformation. The merging process is shown in Fig. 4. Finally, we present the complete algorithm which describes our method, consisting of exploring and combining different regions of the graph: 1: procedure C OMPLETE -B UILDUP(V,E) 2: F ←∅ 3: W ←V ⊲ the set of unexplored vertices 4: while W 6= ∅ do 5: choose any v ∈ W ⊲ choose an unexplored vertex 6: W ← W − {v} 7: C ←F IND C LIQUE(v, neighbors(v)) 8: if |C| ≥ 4 then 9: P ← P LACE C LIQUE(C) 10: Q ← P ∪ neighbors(P ) 11: P ←B FS -B UILDUP(V, E, P, Q) ⊲ partial solution 12: F ←K ABSCH(F, P ) ⊲ try to align and merge 13: W ←W −P 14: end if 15: end while 16: return F ⊲ final solution 17: end procedure E. Analysis The time complexity of the algorithm depends largely on the structure of the graph G(V, E). We consider a graph to be 4-connected if we can’t find 3 vertices whose removal disconnects the graph. The BFB algorithm stops when a frontier of a 4-connected subgraph is encountered, and is restarted by the B UILDUP routine from a previously unexplored region. In general, the breadth-first search takes O(|E| + |V |) time. Aligning the two partial solutions Si , Sj by the Kabsch routine requires time in O(s max(|S1 |, |S2 |)), where s is the number of atoms that are used to compute the rotation matrix (and which are part of both S1 and S2 ). We choose 4 ≤ s ≤ 16, independently of n = |V | the number of total atoms in the structure. Therefore, the alignment routine’s time complexity is bounded by O(n). However, depending on the graph structure (sparsity), the algorithm is re-iterated. The linear algebra routines are performed only on small matrices, which are independent of |V |, so the asymptotic complexity of the B FS - BUILDUP is linear. V. E XPERIMENTAL RESULTS We create a data set artificially, from a random set of structures from the Protein Data Bank [39], with sizes ranging from 516 atoms to 150,720, listed in Table I, by building the graph associated to the molecule, with atoms as vertices and edges connecting atoms that are less than 5 Å apart, corresponding to the distances observable by the Nuclear Overhauser Effect [40]. We also use the structures tested in the related works [22], [24], [25]. We determine the graph density and clustering coefficient for the data structures. The density of an undirected graph G(V, E) is defined as: ρ= 2|E| |V | · (|V | − 1) (6) while the average local clustering coefficient is: c= |V | X i 2|Ei | |Vi | · (|Vi | − 1) (7) where Vi and Ei are the vertices and edges of the subgraph induced by the vertex i and its adjacenct nodes [41]. 1425 TABLE II R ESULTS OF THE BUILD - UP ALGORITHM . Name PDB Id Atoms Placed 1RDG 1HOE 1LFB 1PHT 1POA 1AX8 2KXD 1VMP 1HAA 1F39 1GPV 1RGS 1BPM 2G33 3R1C 2F8V 2XTL 3FXI 1HMV 2VZ8 2VZ9 1AON 3OAA 3K1Q 1VRI 516 581 641 988 1,067 1,074 1,142 1,166 1,310 1,653 1,842 2,059 3,673 4,658 6,865 7,409 7,974 12,500 29,596 30,281 31,949 58,870 99,573 101,798 150,720 511 574 628 949 1,038 1,059 1,140 1,161 1,306 1,612 1,797 1,986 3,627 4,501 6,824 7,173 7,895 12,106 14,300 29,226 30,795 57,999 97,744 93,266 148,452 OpenCL double prec. Duration Error [sec] [Å] 0.19 0.31 0.34 0.19 0.39 0.56 0.30 0.18 0.35 0.63 0.56 0.86 1.24 1.76 3.35 3.11 2.73 6.17 17.66 18.18 18.04 34.43 101.84 101.87 158.69 Traversal Width 4.54e-04 1.03e-06 9.28e-07 9.20e-07 8.51e-07 1.16e-06 1.02e-06 1.70e-06 1.29e-06 1.62e-06 1.79e-06 1.87e-06 1.84e-06 1.32e-03 1.99e-06 4.43e-06 2.41e-06 9.75e-04 9.82e-04 2.56e-01 4.06e-05 1.69e-04 1.78e-02 1.75e-01 5.16e+01 OpenCL single prec. Duration Error [sec] [Å] 171 136 173 235 243 288 296 353 314 325 258 168 923 210 305 495 1,387 818 581 1,732 1,965 1,839 1,328 2,495 27,815 0.09 0.11 0.13 0.10 0.15 0.22 0.11 0.06 0.12 0.35 0.26 0.39 0.61 0.76 1.37 1.38 1.28 2.77 8.83 8.75 9.74 16.98 53.71 51.95 75.53 Java reference Duration [sec] 4.41e-04 1.90e-04 8.48e-04 1.91e-03 4.86e-03 9.21e-03 5.82e-05 3.90e-04 1.49e-04 1.45e-02 5.75e-02 2.79e-04 1.61e-01 7.01e-03 1.90e-01 7.15e-03 3.43e+00 6.87e-01 4.14e-02 1.17e-01 1.25e+00 4.04e+00 5.20e+00 2.66e+01 2.31e+02 0.12 0.14 0.10 0.21 0.26 0.27 0.14 0.28 0.28 0.32 0.48 0.46 0.53 1.05 3.56 3.06 2.96 9.76 27.22 35.67 54.89 146.91 226.65 179.53 541.81 The local clustering coefficient shows how close the neighbors of a vertex are related to a clique (i.e., how well the build-up algorithm reconstructs the coordinates). maximum vertex error defined as s 1 X (kxi − xj k − δi,j )2 Emax = max i ni j Table II depicts the results of the build-up algorithm. We compare the execution times and errors of two OpenCL implementations (double and single precision) to a reference Java implementation (in double precision) from our previous work [25] in Fig. 5. The tests (Java and OpenCL) were run on an Intel Core i7-2600K CPU @ 3.4GHz, with 4 hyperthreading cores. We measure the execution time in seconds, the number of atoms successfully determined, the traversal width (the maximum number of vertices placed in parallel, and it measures the parallelism of the B FS algorithm), and the where the summation is taken over the ni neighbors of vertex i. Next, we concentrate only of the BFS-Buildup kernel. To test the scalability of our implementation, we note that the execution time can be expressed as Tn,p,h where n is the problem size (correlated with the number of atoms and known edges), p the number of threads (OpenCL work-items), and h the chosen hardware (multi-core CPU or GPU). We vary the problem size from molecules containing 1074 to 58,870 atoms and, independently, the maximum number of (8) 12 11 14 6 4 14 9 16 13 6 7 5 4 2 2 11 12 9 7 11 16 13 12 0 2 3 0 15 15 1 3 (a) Partial Solution 1 4 5 0 1 (b) Partial Solution 2 (c) Merged Fig. 4. Merging of partial solutions. The common atoms that serve as basis for the Kabsch algorithm are highlighted. Note that the first solution is rotated in 3D for alignment in the merged solution. Dashed lines link pairs of atoms with known distances, while solid lines show chemical bonds. 1426 TABLE I T HE MOLECULES USED IN THE EXPERIMENTS , CUTOFF DISTANCE 5 Å. Name PDB ID # of Atoms Min deg Avg deg # known distances Density (%) Clustering coeff, c 1RDG 1HOE 1LFB 1PHT 1POA 1AX8 2KXD 1VMP 1HAA 1F39 1GPV 1RGS 1BPM 2G33 3R1C 2F8V 2XTL 3FXI 1HMV 2VZ8 2VZ9 1AON 3OAA 3K1Q 1VRI 516 581 641 988 1,067 1,074 1,142 1,166 1,310 1,653 1,842 2,059 3,673 4,658 6,865 7,409 7,974 12,500 29,596 30,281 31,949 58,870 99,573 101,798 150,720 1 3 5 3 2 2 9 1 8 4 7 3 3 4 3 1 4 3 3 3 3 3 3 3 3 22.51 22.72 21.76 27.77 23.48 23.28 39.29 41.20 40.92 23.17 25.91 23.04 24.40 22.67 27.30 22.29 25.03 22.95 23.37 23.14 23.14 24.18 23.54 24.40 25.32 5,808 6,600 6,974 13,720 12,525 12,502 22,436 24,020 26,801 19,154 23,863 23,720 44,818 52,807 93,699 82,587 99,805 143,426 345,757 350,358 369,726 711,737 1,171,764 1,241,974 1,907,977 4.37 3.92 3.40 2.81 2.20 2.17 3.44 3.54 3.13 1.40 1.41 1.12 0.66 0.49 0.40 0.30 0.31 0.18 0.08 0.08 0.07 0.04 0.02 0.02 0.02 0.578 0.594 0.619 0.597 0.574 0.591 0.608 0.614 0.603 0.575 0.609 0.589 0.567 0.617 0.605 0.578 0.553 0.564 0.586 0.570 0.568 0.585 0.566 0.580 0.569 this to the overhead associated to launching OpenCL kernels. For the small problem size (Fig.7), the speedup saturates quicker, due to Amdahl’s law: at 64 threads and above the speedup remains at about 16 times on GPU, even if there are more computing resources available. Also, the graph reveals that the GPU doesn’t offer any speedup over the CPU in this case. Fig. 5. 64 parallel launched threads from p=1 to p=4096. We run each set of tests both on the Intel Core i7 CPU and the Nvidia GeForce GTX 560Ti GPU. For the given problem size and a hardware architecture, we measure the speedup of the algorithm as Sn,p,h GPU (T1 / Tp) CPU (T1 / Tp) TCPU / TGPU 32 16 Speedup Tn,1,h = Tn,p,h Execution times for 3 different implementations. (9) 8 4 2 1 where Tn,1,h is the execution time of the OpenCL kernel on a single thread. The OpenCL framework allows launching a much larger number of work-items than the physical execution units found on the device, in this case some of the threads are executed sequentially. We launch p threads then wait for completion and repeat the operation until all items in the input queue of algorithm BFS-Buildup are processed: 0.5 0.25 1 4 16 64 256 1024 4096 Number of threads (p) Fig. 6. atoms) Speedup of the BFS kernel, in case of a large molecule (≈ 50,000 while(input queue not empty) clEnqueueNDRangeKernel(global_work_size=p) clFinish() 64 GPU (T1 / Tp) CPU (T1 / Tp) TCPU/TGPU 32 16 Speedup The results for the large problem size are plotted in Fig. 6. We also compare the execution times on the two devices; on the graph we notice that the OpenCL kernel running on the GPU starts to outperform the kernel running on multi-core CPU (TCP U /TGP U > 1) starting from about p > 128 threads. The figure also reveals that the speedup saturates starting at a given number of threads related to the physical number of cores on the device: the GTX 560Ti GPU has 8 Streaming Multiprocessors (each containing 48 scalar processors), while the tested Core i7 processor has 4 hyper-threading cores (each core is capable of running 2 threads). We notice a small increase of speedup even when the number of launched threads exceeds the number of physical execution units. We attribute 8 4 2 1 0.5 0.25 1 4 16 64 256 1024 4096 Number of threads (p) Fig. 7. atoms) 1427 Speedup of the BFS kernel, in case of a smaller molecule (≈ 1000 VI. C ONCLUSIONS We have presented a novel parallel solution for the molecular distance problem. From experiments on parallel systems (multi-core CPU and GPU), we conclude that our implementation show moderate scalability. For very large molecules and datasets, these improvements may become critical. Future improvements of the algorithms will include optimization of the linear algebra OpenCL routines and parallel computation of different partial solutions. We also plan to improve scalability. R EFERENCES [1] T. Kamada and S. Kawai, “An algorithm for drawing general undirected graphs,” Inf. Process. Lett., vol. 31, pp. 7–15, April 1989. [2] U. Brandes and C. Pich, “Eigensolver methods for progressive multidimensional scaling of large data,” in Graph Drawing, ser. Lecture Notes in Computer Science, M. Kaufmann and D. Wagner, Eds. Springer Berlin / Heidelberg, 2007, vol. 4372, pp. 42–53. [3] R. Davidson and D. Harel, “Drawing graphs nicely using simulated annealing,” ACM Trans. Graph., vol. 15, pp. 301–331, October 1996. [4] W. Torgerson, “Multidimensional scaling: I. Theory and method,” Psychometrika, vol. 17, no. 4, pp. 401–419, Dec. 1952. [5] I. Borg and P. Groenen, Modern Multidimensional Scaling: Theory and Applications. Springer, 2005. [6] G. Seber, Multivariate observations, ser. Wiley series in probability and statistics. Wiley-Interscience, 2004. [7] T. F. Havel, “Distance geometry: Theory, algorithms and chemical applications,” in Encyclopedia of Computational Chemistry. Wiley, 1998, pp. 723–742. [8] W. Glunt, T. L. Hayden, and M. Raydan, “Molecular conformations from distance matrices,” J. Comput. Chem., vol. 14, pp. 114–120, January 1993. [9] C. Lavor, L. Liberti, and N. Maculan, “Molecular distance geometry problem,” in Encyclopedia of Optimization, C. A. Floudas and P. M. Pardalos, Eds. Springer US, 2009, pp. 2304–2311. [10] L. M. Blumenthal, Theory and applications of distance geometry. Bronx, NY: Chelsea, 1970. [11] T. Havel, I. Kuntz, and G. Crippen, “The theory and practice of distance geometry,” Bulletin of Mathematical Biology, vol. 45, pp. 665–720, 1983. [12] M. Pharr and R. Fernando, GPU Gems 2: programming techniques for high-performance graphics and general-purpose computation. Addison-Wesley Professional, 2005. [Online]. Available: http://http. developer.nvidia.com/GPUGems2/gpugems2 chapter43.html [13] J. de Leeuw, “Applications of convex analysis to multidimensional scaling,” in Recent Developments in Statistics, J. Barra, F. Brodeau, G. Romier, and B. V. Cutsem, Eds. Amsterdam: North Holland Publishing Company, 1977, pp. 133–146. [14] E. R. Gansner, Y. Koren, and S. North, “Graph drawing by stress majorization,” in Graph Drawing. Springer, 2004, pp. 239–250. [15] J. J. Moré and Z. Wu, “Distance geometry optimization for protein structures,” J. of Global Optimization, vol. 15, pp. 219–234, October 1999. [16] A. Grosso, M. Locatelli, and F. Schoen, “Solving molecular distance geometry problems by global optimization algorithms,” Computational Optimization and Applications, vol. 43, pp. 23–37, 2009. [17] M. Nilges, A. M. Gronenborn, A. T. Brunger, and M. G. Clore, “Determination of three-dimensional structures of proteins by simulated annealing with interproton distance restraints. application to crambin, potato carboxypeptidase inhibitor and barley serine proteinase inhibitor 2,” Protein Eng., vol. 2, no. 1, pp. 27–38, Apr. 1988. [18] A. H. C. Kampen, L. M. C. Buydens, C. B. Lucasius, and M. J. J. Blommers, “Optimisation of metric matrix embedding by genetic algorithms,” Journal of Biomolecular NMR, vol. 7, pp. 214–224, 1996. [19] R. Leardi, Nature-inspired methods in chemometrics: genetic algorithms and artificial neural networks, ser. Data handling in science and technology. Elsevier, 2003. [20] Q. Dong and Z. Wu, “A linear-time algorithm for solving the molecular distance geometry problem with exact inter-atomic distances,” Journal of Global Optimization, vol. 22, pp. 365–375, 2002. [21] ——, “A geometric build-up algorithm for solving the molecular distance geometry problem with sparse distance data,” J. of Global Optimization, vol. 26, pp. 321–333, July 2003. [22] D. Wu and Z. Wu, “An updated geometric build-up algorithm for solving the molecular distance geometry problems with sparse distance data,” J. of Global Optimization, vol. 37, pp. 661–673, April 2007. [23] R. Davis, C. Ernst, and D. Wu, “Protein structure determination via an efficient geometric build-up algorithm,” BMC Structural Biology, vol. 10, no. Suppl 1, p. S7, 2010. [24] A. Sit, Z. Wu, and Y. Yuan, “A geometric buildup algorithm for the solution of the distance geometry problem using Least-Squares approximation,” Bulletin of Mathematical Biology, 2007. [25] L. Fabry-Asztalos, I. Lőrentz, and R. Andonie, “Molecular Distance Geometry Optimization using Geometric Build-up and Evolutionary Techniques on GPU,” in Proc. of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology CIBCB, 2012, to be published. [26] V. D. Silva and J. B. Tenenbaum, “Sparse multidimensional scaling using landmark points,” Stanford, Tech. Rep. 6, 2004. [27] C. Lavor, L. Liberti, N. Maculan, and A. Mucherino, “Recent advances on the discretizable molecular distance geometry problem,” European Journal of Operational Research, 2011. [28] A. Mucherino, C. Lavor, L. Liberti, and E.-G. Talbi, “A parallel version of the branch and prune algorithm for the molecular distance geometry problem,” in Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications, ser. AICCSA. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1–6. [29] L. Liberti, C. Lavor, B. Masson, and A. Mucherino, “Polynomial cases of the discretizable molecular distance geometry problem,” CoRR, vol. abs/1103.1264, 2011, informal publication. [30] G. M. Crippen, “Linearized embedding: a new metric matrix algorithm for calculating molecular conformations subject to geometric constraints,” J. Comput. Chem., vol. 10, pp. 896–902, October 1989. [31] B. A. Hendrickson, “The molecular problem: Determining conformation from pairwise distances,” Cornell University, Ithaca, NY, USA, Tech. Rep., 1990. [32] Khronos OpenCL Working Group, The OpenCL Specification, version 1.1, 2010. [Online]. Available: http://www.khronos.org/registry/cl/specs/ opencl-1.1.pdf [33] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences). W. H. Freeman, 1979. [34] D. Knuth. (2008) MAXCLIQUES algorithm for listing all maximal cliques. [Online]. Available: http://www-cs-staff.stanford.edu/∼uno/ programs.html [35] N. Jardine and R. Sibson, Mathematical taxonomy, ser. Wiley series in probability and mathematical statistics. Probability and mathematical statistics. Wiley, 1971. [36] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient Parallel Graph Exploration on Multi-Core CPU and GPU,” in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2011. [37] D. Merrill, M. Garland, and A. Grimshaw, “High Performance and Scalable GPU Graph Traversal,” Department of Computer Science, University of Virginia, Tech. Rep., 2011. [38] W. Kabsch, “A solution for the best rotation to relate two sets of vectors.” Acta Crystallographica, vol. 32, pp. 922–923, 1976. [39] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, “The Protein Data Bank,” Nucleic Acids Research, vol. 28, pp. 235–242, 2000. [Online]. Available: http://www.rcsb.org/pdb/ [40] K. Wüthrich, “Protein structure determination in solution by NMR spectroscopy.” Journal of Biological Chemistry, vol. 265, no. 36, pp. 22 059–22 062, dec 1990. [41] R. D. Luce and A. D. Perry, “A method of matrix analysis of group structure,” Psychometrika, 1949. 1428