GPU Accelerated RNA Folding Algorithm

advertisement
GPU Accelerated RNA Folding Algorithm
Guillaume Rizk1 and Dominique Lavenier2
1
Univ-Rennes 1/IRISA
ENS-Cachan/IRISA
IRISA - Symbiose Campus universitaire de Beaulieu,
35042 Rennes Cedex, France
{guillaume.rizk,dominique.lavenier}@irisa.fr
2
Abstract. Many bioinformatics studies require the analysis of RNA or
DNA structures. More specifically, extensive work is done to elaborate
efficient algorithms able to predict the 2-D folding structures of RNA
or DNA sequences. However, the high computational complexity of the
algorithms, combined with the rapid increase of genomic data, triggers
the need of faster methods. Current approaches focus on parallelizing
these algorithms on multiprocessor systems or on clusters, yielding to
good performance but at a relatively high cost. Here, we explore the
use of computer graphics hardware to speed up these algorithms which,
theoretically, provide both high performance and low cost. We use the
CUDA programming language to harness the power of NVIDIA graphic
cards for general computation with a C-like environment. Performances
on recent graphic cards achieve a ×17 speed-up.
Keywords: GPGPU, RNA, secondary structure, minimum free energy.
1
Introduction
The computation of secondary structural folding of RNA or single-stranded DNA
is a key element in many bioinformatics studies and, as such, has been extensively
studied for many years. The firsts to propose an algorithm to predict the folding
structure of RNA or DNA sequences were Waterman, Smith and Nussinov et al.
[1,2]. This algorithm was based on dynamic programming with a complexity of
O(n3 ), yet their approach had several issues.
Following this pioneer work, several improvements have been done leading to
different kinds of dynamic programming algorithms. We can cite: (1) the computation of the most stable structure through energy minimization running in
O(n3 ), introduced by Zuker and Stiegler [3] which outputs a single optimal structure and its corresponding energy ; (2) the computation of a partition function
over all possible structures for deriving additional properties of the thermodynamic ensemble such as the base pairing probabilities of any base pair, introduced by McCaskill [4] ; (3) the computation of suboptimal structures [5] which
generates all structures within a given energy range of the optimal one. Implementations of those algorithms are found in two major packages, ViennaRNA
and Unafold [6,7].
G. Allen et al. (Eds.): ICCS 2009, Part I, LNCS 5544, pp. 1012–1021, 2009.
c Springer-Verlag Berlin Heidelberg 2009
GPU Accelerated RNA Folding Algorithm
1013
Despite many huge efforts to reduce the algorithmic computational complexity, execution times are steadily increasing due to the fast growing of genomic
databases and, for the last years, the relative stagnation of the microprocessor
frequencies. One solution is to use multi-core systems or clusters, which can yield
good performance but at a high cost. Another approach is the use of computer
graphics hardware, which possibly exhibits a higher performance/cost ratio than
clusters.
Indeed, the raw power of graphics processing unit (GPU) has a faster increase rate than traditional microprocessors. Moreover, recent improvements in
the programmability of GPUs have opened the way to new applications from
which GPUs were not initially designed for. General purpose computation on
GPU (GPGPU) is now a field of research investigated in many domains requiring high performances. Among many others, successful applications include
bioinformatics with the Smith-Waterman sequence alignment [8,9].
In this paper, we investigate how GPUs can be used to accelerate the computation of the minimum free energy of RNA or DNA sequence folding. We use the
implementation of the Unafold package given in the function hybrid-ss-min [7].
This function is intensively used in different programs of the Unafold package
and represents the most time consuming part. We show that adding a graphical
board can speed-up the whole program by a factor ×17 compared to a sequential
execution on a one-core microprocessor.
Although the RNA folding algorithm studied uses dynamic programming just
like the Smith Waterman algorithm, they should not be confused. Both algorithms are very different, thus previous GPU implementations of the Smith
Waterman algorithm [8,9] did not prefigure the feasibility of an efficient GPU
implementation here. On the contrary, its complexity in terms of memory access
patterns and parallelization issues makes it a real challenge.
The rest of paper is organized as follows: In Section 2, we introduce the folding
algorithm. In section 3, the GPU implementation of the folding algorithm is
explained. Finally, section 4 gives the performance results obtained on different
platforms.
2
Folding Algorithm
This section briefly exposes the principles of the folding algorithm as implemented in the Unafold package in the function hybrid-ss-min [7].
2.1
RNA Structure
RNA or Ribonucleic acid is a chain of nucleotide units. There are four different
nucleotides, also called bases: adenine (A), cytosine (C), guanine (G) and uracil
(U). Two nucleotides can form a bond thus forming a base pair, according to
the Watson-Crick complementarity: A with U, G with C; but also the less stable
combination G with U, called wobble base-pair. All the base pairs of a sequence
force the nucleotide chain to fold into a system of different recognizable domains
1014
G. Rizk and D. Lavenier
like hairpin loops, bulges, interior loops or stacked regions. This is called the
secondary structure of the sequence. The different loop types are introduced in
Fig. 1. The secondary structure can also form complex patterns like pseudoknots
which consist of two base pairs i·j and k·l that do not verify the nesting property
i < j < k < l. The secondary structure is often determinant in the functional
role of the RNA molecule.
2.2
Energy Model
The algorithm is designed to find the most stable structure of a RNA sequence.
It is used in many bioinformatics pipelines such as the search of micro RNAs
where the stability of the secondary structure is an important feature.
A secondary structure is described by a list of base pairs i · j where each
base forms at most one pair. The algorithm is based on a decomposition of the
secondary structure into its constituent loops. Each loop is associated with an
experimentally measured energy according to its sequence, length and type. The
stability (free energy) of a structure is the sum of the energies of all its loops.
In the dot bracket representation given in Fig. 1, an unpaired base is depicted
by a dot, and a pair by a matching pair of parenthesis. In the model used,
matching pairs of parenthesis have to be well nested, i.e there are no pseudoknots.
This restriction is a requirement to allow a relatively fast dynamic programming
approach as the one developed by Zuker and Stiegler. Indeed, it ensures that
the secondary structure of each subsequence i, j can be computed independently
from the rest of the sequence, a required feature for dynamic programming.
2.3
Algorithm
The dynamic programming algorithm uses three tables: Qi,j is the minimum
energy of folding of a subsequence i, j given that bases i and j form a base
pair; Qi,j and QMi,j are the minimum energy of folding of the subsequence
i, j assuming that this subsequence is inside a multiloop and that it contains
respectively at least one and two base pairs. A simplified model of the recursion
relations can be written as:
Qi,j
⎧
⎪
⎪ Eh(i, j)
⎪
⎨ Es(i, j) + Qi+1,j−1
min
min 2 Ei(i, j, k, l) + Qk,l
=
⎪ k,l∈]i;j[
⎪
⎪
⎪
⎩
⎪
⎪
QMi+1,j−1
⎪
⎪
⎩
∞
⎧
⎪
⎪
⎪
⎪
⎪
⎨
⎫
⎪
⎪
⎪
⎬
⎪
⎪
⎪
⎭
if pair i · j is allowed
(1)
if pair i · j is not allowed
QMi,j = min (Qi,k + Qk+1,j )
i<k<j
Qi,j = min QMi,j , min(Qi+1,j , Qi,j−1 ), Qi,j
Eh(i, j) Ei(i, j, k, l) and Es(i, j) are respectively the energies of:
(2)
(3)
GPU Accelerated RNA Folding Algorithm
1015
5
6
4
3
2
1
AAAAAAGGGAAAAGAACAAAGGAGACUCUUCUCCUUUUUCAAAGGAAGAGGAGACUCUUUCAAAAAUCCCUCUUUU
((((.(((((...(((.((((((((....)))))))))))...(((((((....))))))).....))))).)))) (-24.5)
Fig. 1. Secondary structure. The secondary structure begins in 1 with stacked base
pairs (two closing base pairs with both sides of the loop of length zero). 2 is an interior
loop (two closing base pairs with both sides non null). 3 shows a multiloop (several
closing base pairs). 4 is a bulge loop (two closing base pairs with one loop side of length
zero and the other greater than zero. 5 and 6 are hairpin loops (one closing base pair).
The structure can also be written in a dot bracket representation where an unpaired
base is a dot and a base pair is a matching pair of parenthesis. The free energy of the
structure (−24.5) is the sum of the energies of its constituent loops.
– Eh(i, j): a hairpin loop closed by the pair i · j.
– Ei(i, j, k, l): an interior loop formed by the two base pairs i · j, k · l.
– Es(i, j): two stacked base pairs i · j and (i + 1) · (j − 1).
These functions compute energies through the use of lookup tables containing
energy parameters according to the size and sequence of the loop.
Ej being the minimum free energy of subsequence 1 . . . j, the minimum free
energy En of the whole sequence is then obtained through the recursion:
Ej = min Ej−1 , min (Ek−1 + Qk,j )
(4)
1<k<j
Dynamic programming using this recursion computes the minimum free energy
of a sequence of length n in O(n2 · L2 + n3 ) by restricting the loop size of
interior loops to L. The corresponding secondary structure is then obtained by
a trace-back procedure.
3
3.1
GPU Implementation
Architecture and Programming
GPUs are massively parallel architectures providing cheap high performance
computing. We choose in our work CUDA as it combines high performance with
1016
G. Rizk and D. Lavenier
the ease of use of a C-like environment [10]. The latest NVIDIA GPU, the GT200,
is divided into 30 multiprocessors each being a SIMD unit of 8 32-bit processors.
A GPU procedure is a kernel called on a set of threads, divided in a grid of
blocks each running on a single multiprocessor. Furthermore Blocks are divided
in warps of 32 threads that must execute the same instruction simultaneously.
Thus, branching (if-then-else control flow instructions) does not impact performance as long as each thread within a warp take the same code path. Moreover,
only threads within a block can be synchronized and can share the fast on-chip
shared memory. One key difference with a traditional CPU implementation is
that the programmer has to explicitly handle several memory spaces of different performance, size, scope and lifetime: global, texture, constant and shared
memory as well as registers.
3.2
Parallelization Scheme
Algorithm 1 shows the main loops of the computation along with the several
ways to expose parallelism. We chose a mixed approach: we compute the minimum free energy of folding of several sequences in parallel, each one being itself
parallelized. According to this parallel scheme, we can provide the GPU with
many independent tasks together with a low memory consumption. The number
of sequences being computed simultaneously is adapted according to their length:
one large sequence can provide enough independent tasks to the GPU whereas
small ones have to be computed by groups. We also implemented a multi-GPU
algorithm by dividing work among GPUs at the coarse-grained level, each GPU
computes a different group of sequences.
1 . . .
. . .
j
1
i
.
1
n
n
1
X
.
.
.
.
.
n
n
Fig. 2. Left: Data dependency relationship. Each cell of the matrix contains the
three values Q , QM and Q. As subsequence i, j is the same as subsequence j, i only
the upper half of the matrix is needed . The computation of cell i, j needs the lower left
dashed triangle and the two vertical and horizontal dotted lines. Right: Parallelization. According to the data dependencies, all cells along a diagonal can be computed
in parallel from all previous diagonals.
GPU Accelerated RNA Folding Algorithm
1017
Figure 2 shows the data dependencies coming from the recursion (1) to (3). They
imply that, given all previous diagonals, all cells of a diagonal can be processed independently. Three kernels are designed for the computation of Qi,j , QMi,j and
Qi,j , according to equations (1) to (3). Each one computes one diagonal of several
sequences. The whole matrix is then processed sequentially through a loop over all
diagonals. The next step corresponding to equation (4) is a combination of reductions (search of the minimum of an array) which is parallelized in another kernel.
The final step, the traceback procedure for computing the secondary structure, is
currently left on the CPU as its execution time is far lower.
3.3
Optimization Key Points
Memory accesses are the bottleneck of the implementation. Here, the data are
divided into three groups: the base sequence, the three tables Q , QM ,Q and
the energy parameters needed for the computation of loop energies. Maximum
performances are obtained when available memory resources are used to their
maximum and when the best match between the different memory spaces and
type of data are found. Here, the texture memory is used for the sequence and
parts of the tables which both show some spatial locality in their access pattern, as for the computation of one cell QMi,j where equation (2) shows that
accesses to all elements of a line and column of matrix Q have to be made.
For energy parameters, the best choice is the constant memory. However, its
small size compels us to also employ the global memory for the least used ones.
Lastly, the shared memory is kept for storage of intermediate results in the
computation.
Another important issue of the implementation comes from equation (1)
which shows that the computation of table Q is not the same for all cells:
if the pair i · j is forbidden then cell Qi,j is set to ∞. This hurts the SIMD
model of GPU which, as stated section 3.1, says that in order to get full performance all threads of a warp must execute the same instruction path. To solve
this issue our implementation computes on CPU an index of all the cell positions that have their base pairs allowed, which is then handed to the GPU.
This increases the amount of data transferred between the CPU and the GPU
but decreases branching in GPU kernels. Moreover CPU computation can be
overlapped with GPU computation thus allowing us to better use all available
resources.
We found that for maximum efficiency the parallelization has to be done
up to the the finest grain achievable, to ensure the GPU reaches its maximum
potential while using the less memory possible. Different levels of parallelization
are exploited: parallelization across several sequences, across several cells of a
diagonal, and across tasks required for the computation of a single cell itself: the
search of a minimum is parallelized on several threads of a same block sharing
intermediate results through shared memory.
1018
G. Rizk and D. Lavenier
Algorithm 1. Main function and parallelizable loops
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
4
Input: N sequences of length L
Output: minimal energy of the N sequences
Coarse-grained level: parallelization over multiple sequences
for sequence s in [1; N ] do
for diagonal d in [1; L] do
Medium-grained level: parallelization over multiple cells of a diagonal
for i in [1; L − d] do
j ← i+d
Fine-grained level: parallelization over the minimization computation
compute Q (i, j, s), QM (i, j, s), Q(i, j, s)
end for
end for
compute EL (s)
end for
Results
GPU and CPU implementations are both compared on different graphic cards
and processors. The main testing platform is an octo-core Xeon E5430 2.66Ghz
(4 × 6MB L2 cache) with 8GB RAM and two NVIDIA Tesla C870 cards, each
having 16 multiprocessors. We also test older processors, a Pentium 4 3Ghz
(1MB L2 cache), a Core2 6700 2.66GHz (4MB L2 cache), and the latest highend graphic card the NVIDIA GTX280 with 30 multiprocessors.
4.1
Analysis on 120 Bases-Long Sequences
Problem specifications. A typical use of the algorithm is the computation
of the secondary structures of many small RNA sequences. The search of micro
RNAs in a whole genome requires, for example, to know the secondary structure of millions of sequences of length approximately 120 [11]. Therefore we
first choose to test the algorithm on sequences of this length, here with 40000
randomly generated sequences.
Figure 3 reports running times in seconds and the corresponding speedup
achieved by different combination of cards versus one or eight CPU cores. Our
CPU multi-core implementation is done on a coarse-grained level by parallelizing
the work over multiple sequences, corresponding to line 4 of algorithm 1.
Results. We achieve a speed-up of about ×10 for one Tesla card versus one
core of a Xeon. An interesting point is that although the algorithm was originally
developed with the Tesla, it scales well with the latest graphic card. The GTX280
is 70% faster than the Tesla with a speed-up of ×17 versus one core of a Xeon,
which roughly corresponds to the increase of memory bandwidth between the
two cards. With the two Tesla, the speed-up becomes ×19, and two GTX280 get
×33.1, which shows that the processing power of cards adds up well when used
together.
GPU Accelerated RNA Folding Algorithm
1019
Fig. 3. Left: Execution Time. Time spent in seconds for the computation of the
minimum free energy of 40000 randomly generated sequences of length 120, energy
only (option -E of the hybrid-ss-min function). Processors are: P4 a pentium4 3.0Ghz
(1MB cache), C2 is one core of a core2 2.66 Ghz (4MB cache), Xeon and Xeon*8
are respectively one and eight cores of Xeon 2.66Ghz (6MB cache). Graphic cards are
NVIDIA Tesla C870, GTX280, bi-Tesla C870, and bi-GTX280. Right: Corresponding speed-up. Acceleration ratio of graphic cards versus Xeon processor, one core or
octo-core configuration.
Accuracy. Our GPU implementation uses exactly the same algorithms and
thermodynamic rules as Unafold, thus the results and accuracy obtained on
GPU is exactly the same as the standard CPU Unafold function.
Performance / cost analysis. When using a parallelized version of the algorithm on the eight CPU cores, speed-ups are much less (Figure 3), yet the
performance/cost ratio is clearly in favor of the GPU implementation. Indeed
our results show that a system with two GTX280, easily doable for 2500 euros,
would be roughly equivalent to four octo-core computers costing a total of more
than 8000 euros.
As for standard computers at everyone disposal, the advantage of GPUs is
also obvious: considering every systems are now dual-cores, adding a GTX280
would allow to get at least ×8 performance even if both CPU cores are used, at
a cost of about 400 euros.
4.2
Analysis Across Varying Sequence Lengths
The algorithm is then experimented upon with various sequence lengths. Speedup of Tesla and GTX280 versus one Xeon processor core are showed in figure 4.
It should first be noted that the GTX 280 is always at least 50% faster than the
Tesla except for very long sequences, where it begins to lack memory (Tesla has
1.5 GB whereas GTX 280 has 1.0 GB). We see that performance is good for short
G. Rizk and D. Lavenier
10
0
5
Speed−up
15
20
1020
10
50
500
5000
Sequence Length
Fig. 4. Speed up comparison. Speed up of Tesla C870 and GTX 280 graphic card
versus one core of a 2.66 Ghz Xeon for randomly generated sequences of different
lengths. Solid line is Tesla C870, dashed line is GTX 280.
sequences (Tesla gets ×10 speed-up), then it comes to a minimum for 1000 bases
long sequences (Tesla gets ×7) and it rises again for very long sequences (×12
for Tesla with sequence of length 9000 ). This comes from the fact that different
portions of the code do not have the same computational complexity and GPU
efficiency. With n the length of a sequence, QM computation is in O(n3 ) whereas
Q computation is in O(n2 ). The efficiency of the O(n2 ) part decreases when n
increases due to different memory access patterns, which explains the decrease
in performance. The O(n3 ) part of the algorithm is always very efficient on GPU
but only becomes a preponderant part of the algorithm for long sequences, which
explains the overall speed up increase we observe for long sequences.
4.3
Comparison Against GTfold
A.Mathuriya et al. implemented a CPU multicore algorithm for RNA secondary
structure prediction which uses what we call in algorithm 1 the medium-grained
level [12]. They compute in their study the folding of the HIV-1 sequence and
a set of 11 Picornaviral sequences on a 32-core IBM P5-570 server. Table 1
compares the running time they obtain against our GPU implementation on one
Tesla C870 card. It shows that an expensive 32-core server only gets ×1.6 the
performance of a single GPU.
Table 1. Running times on HIV-1 sequence (9781 nucleotides) and a set of 11 Picornaviral sequences (7124 to 8214 nucleotides), cf [12] for sequence accession numbers
HIV-1
11 Picornavirus
GTfold 32-core IBM P5-570 GPU Tesla C870 Unafold 1 core Xeon
84 s
133 s
1876 s
480 s
765 s
7902 s
GPU Accelerated RNA Folding Algorithm
5
1021
Future Work
This work is the first step in parallelizing RNA folding algorithm on GPU.
It shows that GPUs can deliver significant speed-ups even on algorithms with
complex memory access patterns.
However, although GPUs recently became easier to use, an efficient GPU implementation remains a lengthy process. For years programmers have developed
purely sequential algorithms, yet it appears that future systems will become
more and more highly parallel architectures. Thus, a future challenge will be to
find a way to facilitate implementation of algorithms for a parallel execution; on
multi-core chips using the MIMD paradigm, on GPUs using the SIMD paradigm,
and the trickiest task, on a combination of both.
References
1. Waterman, M., Smith, T.: RNA secondary structure: a complete mathematical
analysis. Math. Biosci. 42, 257–266 (1978)
2. Nussinov, R., Pieczenik, G., Griggs, J., Kleitman, D.: Algorithms for loop matchings. SIAM J. Appl. Math. 35(1), 68–82 (1978)
3. Zuker, M., Stiegler, P.: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9(1), 133–148
(1981)
4. McCaskill, J.: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29(6-7), 1105–1119 (1990)
5. Wuchty, S., Fontana, W., Hofacker, I.L., Schuster, P.: Complete suboptimal folding
of RNA and the stability of secondary structures. Biopolymers 49, 145–165 (1999)
6. Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M., Schuster,
P.: Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 125,
167–188 (1994)
7. Markham, N., Zuker, M.: DINAMelt web server for nucleic acid melting prediction.
Nucleic Acids Research 33, W577–W581 (2005)
8. Liu, W., Schmidt, B., Voss, G., Schroeder, A., Muller-Wittig, W.: Bio-sequence
database scanning on gpu. In: Proceeding of the 20th IEEE International Parallel
& Distributed Processing Symposium: 2006(IPDSP 2006) (HICOMB Workshop)
(2006)
9. Svetlin, M., Giorgio, V.: CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics 9 (2008)
10. Nvidia cuda, http://developer.NVidia.com/object/cuda.html
11. Stark, A., Kheradpour, P., Parts, L., Brennecke, J., Hodges, E., Hannon, G.J.,
Kellis, M.: Systematic discovery and characterization of fly microRNAs using 12
Drosophila genomes. Genome Research 17(12), 1865 (2007)
12. Mathuriya, A., Bader, D., Heitsch, C., Harvey, S.: GTfold: A Scalable Multicore
Code for RNA Secondary Structure Prediction. Technical report, Georgia Institute
of Technology (2008)
Download