: A Scalable Multicore Code for RNA Secondary GTfold Structure Prediction

advertisement
GTfold: A Scalable Multicore Code for RNA Secondary
Structure Prediction
Amrita Mathuriya
College of Computing
Georgia Institute of Technology
David A. Bader
College of Computing
Georgia Institute of Technology
Christine E. Heitsch
School of Mathematics
Georgia Institute of Technology
Stephen C. Harvey
School of Biology
Georgia Institute of Technology
August 23, 2008
Abstract
The prediction of the correct secondary structures of large RNAs is one of the
unsolved challenges of computational molecular biology. Among the major obstacles
is the fact that accurate calculations scale as O(n4 ), so the computational requirements become prohibitive as the length increases. Existing folding programs implement heuristics and approximations to overcome these limitations. We present a new
parallel multicore and scalable program called GTfold, which is one to two orders of
magnitude faster than the de facto standard programs and achieves comparable accuracy of prediction. Development of GTfold opens up a new path for the algorithmic
improvements and application of an improved thermodynamic model to increase the
prediction accuracy.
In this paper we analyze the algorithm’s concurrency and describe the parallelism
for a shared memory environment such as a symmetric multiprocessor or multicore
chip. In a remarkable demonstration, GTfold now optimally folds 11 picornaviral RNA
sequences ranging from 7100 to 8200 nucleotides in 8 minutes, compared with the two
months it took in a previous study. We are seeing a paradigm shift to multicore chips
and parallelism must be explicitly addressed to continue gaining performance with each
new generation of systems. We also show that the exact algorithms like internal loop
speedup can be implemented with our method in an affordable amount of time. GTfold
is freely available as open source from our website.
1
Introduction
RNA molecules perform a variety of different biological functions including the role of “small”
RNAs (with tens or a few hundred of nucleotides) in gene splicing, editing, and regulation.
1
At the other end of the size spectrum, the genomes of numerous viruses are lengthy singlestranded RNA sequences with many thousands of nucleotides. These single-stranded RNA
sequences base pair to form molecular structures, and the secondary structure of viruses like
dengue [3], ebola [16], and HIV [17] is known to have functional significance. Thus, disrupting
functionally significant base pairings in RNA viral genomes is one potential method for
treating or preventing the many RNA-related diseases.
A G
A
A
A
G
A
U
G
A
A
A
U
G
A
G
A UG
AU C
A
AAU
G
C G
C G
C G
C AG
A
GG C
AU U
C GG
C
A
G
C
U
U
A
A
C
U
A
AG A
AU
AA A
C C
A AG
U
G A U
A 4050
U G
U
C A AG C G
U AA
U
A
C
AA U
A U CG
U
AC G
G C
A
GG
UA G
G
U
GA
C
A
A
UG
U
UU
GA CC UA U
G
UGG U
A
A
A
AC
A
UG UCU
G
AG GCC
A
U
U
A
CG G
U
A
C
A
AU
C
G U A A
AG
U
C
A
UG
C
C
G
C
AA
G
G
U
C
UA G
UA
A
AC
AG
U
G C
A U
A
A
A
A
A
C G
A
CG G GA C G
A
C
CU 4000
C A A AG
A
A G
AU
G A
A
U A AUGU
A
G
U U U AG
C
C
A
A
UU A
AC
C
C
G
U
A
GG
C
U
A
G
C
C
G
A
U
C
A
U
U
A
A
A
U
U
C
G
G
G
A
A
G
U
A
3800
U U
C
5250
U
GU
G
A
G
C
C
A
U
G
C
U
A
C
GG C
C
G AAC
A A
U
GG
UA
A C
A
A AG
A
A
U
U
C G
U
U UA
G
A
AU
U
UA
4450
A
UU
U
AG
A G
U
A
C
A
A A U
U G
GA
G C
U A
U U U A AC
G A A
GA
C
G
G
G
AA G A
GA
G AA
A
G
G
U
A A
GUG A
A
CA
G
A
U
C
G
A
A
U
G
A
G
U
A
A
C
C U
AA
U
GG
U
C
A
A
G
U
A
U
U A
G
A
U
AA
A A
G
G
U
GA A A
G
G C
UG
G
U
A
C C
A
U
G
GG G U
A
U
C A CU
GA A
UG
G
G
AU
AAC
4250
U G
CU
CAA
U
GU
AG
UC
UG
AA
U AA
AU4400
4200
U
UG
C
GAC
A
A
UG
U
G
C
CA
C
C
AAC
U
G
U
UGA
G
U
A
C
4350
U
A
A
G
A
C
C
G
C
A U
C
U
UC 3850
AG CA
GU CC
AG A
U
GC
G
A
A
A
A
A
A U
G
A
C A
U
GA
A
A
C
U
A
UU U
AG
C
GG
C
C AGC
GA
AC
U
C
A
A
A
A
U
UU
A
U A
A
U
A
C
U
G
A
UA
U
A
G
A G
C
G
A
A
U
A
G
U C
U
A A
A C U
U
A
C
G
G
C
C G U
A
C
A A
A
G U
U U
A
U U
U A C
U
U G
A U
C A
A C
U
U
G U
A G
A
A
A
A
G
A
G
A
C
G C
A
A
U
C
AG A
AA
A U
U
U U A
GA CU
A
C
GG 4550
4750
C
A
U
A
A
G
AA UU
UA AU
A
G
A
C GA
CA
G
G C
U
G
G
G
G
A
G
U
G
2400
A
A A
U A
C A
C
U C G U
A
A G C
C A A
G U U
A C G
G
A
A
U A
U
A
A
AG
G
A
G
G
A
AU
A
A
AA U A
GA
U
U
G
C
A C
U
U
A
C
A
AA
GGA
C
C
G
A
G
A
U
A
G 3550A
C
U
A
C C
U
G
C
A
G
C A
A
A
C
G G
U
G U
C
A
U
3600
C
C
G G
C
G G
G
C U
C C
AG
C C
G G
U
2050
G
C G
A U
A
AA
C
G
A
U
U
A
A
U
A
A
U
CAC
C
A
A
A
U
A
U
4900
A
C
G
A
G
G
U A
A
G
A
U A
U
G
C
A
A
A
C
A
U A
U
A C
G
G C
A
U U G U
G
A G
G
U
U A
U
G
G
A A
AA A G
U
C
A
A G
U
U G
A
A
U C
C C
A A GG
U A A A
G
C
G
A
A
A
G
U
UAAAG C
AAAC
AC
AU
AC
AG
C
A
A
G
U
U
G
C
G
G
G
G
U
A
G A
A
AG
G
A
C
U
GA
A
U U
U
6150
A
U
A
A
A
U
A
G
A
A
A
G
A
G
U
A
G
C
U
A
G
U
G
A
A
U
A
U
C
G
A
U
U
G
A
U
C
GC
C
A
U
G
A G
A
A
U
A
G
C
U
A 6250
GG CA A GA
A
U
AC G
C
A
A
C
A
G
A
U
A
G
U
C
A
GG A
G
C
G
A
AA
A
G
A
U
C
G
AC
CA G
G 6300
G
A
GA
G
U
U
U
C
G
G
C
G
C
C
U
A
A G AC
AC
G
A
U
U
G
A
5600
G
C G
C G
A
U
A
A
G
A
G G
U U
AA
A A
G
A
U A
U
A
AU AA A
U
UU UGA
A
AG
6200 A
AU
U
CA
A
5650
A
A
A
U A
A U
A
A
C
C
AA
A
C
G
A
U
A
A
C
A
A
C AA
GG
A
U
U
C
C
G
A
C
AG C
G
G
AU
G
A
C U
U
G
A
G
A
U
A
G
U
G
U A
C G
C G
A
C
A
G
A
C C
G
G
G
C
A
A
U
U A
U G
A
A
A
CA
2000
G A G
A
U
A
A
A
G C
U
AA
U A
G C
U
U
A
A
A
U
G
A A
C A
U A
C G
G
C
A
U
G
A
G
C
A
A C
A
G U
A
C
U
A
A
A
A
A
U
U
U
U
A
C
A C
U
G
A
C
C
G
A
G
C
A
U
G AU
A
A
A
U
U
G
C
U
G
A
G
G
C
C
U
U
A
A
A6100 C A
U
U
U
U
G U
A
A
AU
U
U
C
AAC G
U
AU
U
U
GA
A
C
G
A
G
G
A
G
U
C
A
A
G U
A
G
A
A
G
A
A
G
C
A C
G
G
U C
A
A
G
U C A A
6050
U CG G
U
A G
C
G
UC C
C
CA AG A
GG GU
G
A
C
G
AC
C
G
A
C
C
A
A
A
A 5550
A
5000
A
A
GG
AA
A
AC
G U
U
G
U
A
A U
A U UA C
G U
G
U
U G
A
G
G UC
A
A
G GC
C
AA U
U
A
A
G
A
C
A
2100
2150
U AA U
G A
U
A
G
C
A
U
A
A A
G C
G C
A
U A
U A
A
A
A
A
G
G
A
A
4650
C
G
U A C A G
A
C
U
U
C
A
G C
G C
G
C G
C G
A
G
A
A
U
C
G
G
G
C
G
C
U
A
U
A
C
G
A
U
A UC
U
G
C
U
C G
A A A
G
G
C
A
U
5700
A
A
A
U
G C
A U
A
A
U
A
C
G
C
A
G
CC GG
U
GGC
CGA
A
CA
U
A
U
U
U
A
A
A
A A
AG
G A
A G
C U
C
A
U
U C A
C
C A
G
A
G
A
2200
A
A
A
G
G
U
G
C C
A
A
A
G
G
G G
C C
G C
C A
A
A G
G G
A A
G
C G
A A A A
C C
G U
G A C
U U
G A G A
C A
G G
A
C
U
G U
A
U
A A
G
U C
C U
G U
U U
A G
A
G A
G A A
A G
A
C
CAG
G
A
A
A
A U
GA
A U
G U A
GU CG
UA AU
A
A U
U A
4600
G
2250
G
G
A G A
CA C
C
C U
G GU
G U
A AU U
A
CG GU
G GU U
A AA
U CA
C A A GA U
C UA A
U
C CA G
C U U
GA
A U
A U
U
C A G
C GG
G
A
G
G
G U
U CA G
G CU G
G CC G
A U
A 4850
U A C
UU A A
4700
C
G
A
U
G U
A
A C
U
A
U
A
A U
U A
U
G
G U
A
G
A C A2350
A
A AU
C
G A
A
C G C
G
AC A U A G
A A G G
A
A
U
G
A
C
C
G U C
U
C G
G A
U
G
U C
U C C
G G G GA A G G G G
G C
A U U
C
G C
A
A
C CA U
C U C C
UU C C U U
A
U U C
C
A
A
AA
A
U
A
C
A
U A G C
G
A
U
A
A UG U
C G
C G
A
A
2300
C G
C G
AA
A
GU
C
G
G G
C C
A
U U
A
G
C A
A
A
U
C
G
A
A
U
6000 G G
U CG U
5750
G
GG
C
C
A U
G
UG
U
U
UC C A
CAC A
UG
U
U
G
AUC
A
A C UUG AG
C
A
AA
C
G
CC
G
C
U
GG
U
A A
U
A
UU
AG
C
GG
CC
UG
C
A
C
U
AG
A
G
C
U A
G C
A
A
G
G
A
C
G
G
A
CA
G
A
C G
G C
A
G
U
G
A
G
U
A
5050
C G
U A
U
A
G
U
A
G
C
U
A U
C G
A
C
A
C A
A
U
U
G
U G A
C
G
U AC
A A
A
AC
C
G
A U A
U
C
G A U
G
U A G
U C A C
A
A
A
A
U A
A U
A
A
A
A
U
G
A
U C
A
A
U
A
A
A
G
A
U
CA
G
A
A
A
A
G
A
A
A
G
GC
AA
A A C
3650
C
A
GUU
A
U
G
A
U
GAG
A
A
A
A
C
A
C
A
U
A
G C
A
G U
AG G
A C
A A CU
GU
UU
C
U U
GA C A
C A
A A A A
A
C G
C
A
A C
3250
A
G G G
A
C
G A
G
A GGA GU C
3450
A
A
AG
U
A
G
C A
CA
A
G
C
GA A
C U
U
CU G U
3400
U
A A C A
C U U G C
A G U A
C U
G U G
GU
U
G
U
C
A
A
U
G
G
A
UA
U
U G A C
C
C
G
GA GC A A
C
AG
A A
A
A
G CG U
A
G
U
UU
CG
U
UG
3350
G A A
U
A
C A GC
G
A G
C
A A
C U
A
U A
U
A
A A
C C A
A G
C C
U U
C
GU
U
A
A G
G
U C
U U
A
A
G
U
C
G G
A
A U
A
A
A G
2450
U
U C
A
G
C
A
C U
G
A
U U A
C
C U G G
G
A G
A G A
G G
GU A
G
A U
G G
A C
A
A U U U
A
GA A GC
G
A
U A
U U AG
U G
AA
A
A
A A
A
G
A
A A C A
C
U GA
A
A
A A
U A
AC
3300
G
G U GC
A C
U A
U C
U GA
A
C A
U
U CC
A G
A
C
G U
A A
G
A G
A
A
3500
C U
G G
G
C A
A C
A A
G
C G
A
A
G G A
U
G
A
U U
GA
C A
A
G
G
U
A A
U
U
A
A
UG
AU C
GC
A
C
U
G
G
G U
G
C
A
U
A
U
G
C
U A
A
C
G
U
A
U
A
U
A
A
A
U
A
U
A
A
AG
G
U
G
A GC
AC
G
A
U
A
G
UG
C
GU
GG
UC
UA
AU
GG
CU
A
UU 5850
GA
GG
A
C
G
AC
C
U
G
U
U
U
G
A
U
A
A
U
A
G
A
A U
AG A
G
C
A
5500
U A
A
A
U A
A
A
G
A
A
G
U
G C G
A
A
U
G C U
A
U A U
5450 G
A U C
G
U A
A
A
AU
U
G
G
C
G
G CA
U
U
A UU
U AG C
A
U G
U
A U
A
C GG
C
AAA A
G A UU
A
U
G
G
C
G
G
U
A
A A
G
U A
4800
AA
A
GA
A
AG
C
G
C
U A
G U
G C
A U
C G
G C
G U
C
A
C
C
A
A U G
4150
AU
U A
C C
A
U A C
G
G U
2900
A
A U
A
C
C U
G
A
U
A G
A A
A
AU
A
A G G
U
U
A
U A
U
A
U
G
C
C
A
U
3150
G
C
C
A
G
U
A
A A A
U G
A A G
A
3000
U
G
A
A
A
U
C
G
G U A
U
A
C
U
A
U
A
U G
G A
A
U
U
GU
A U AA
G G G
A AA
A U A U
A AA
U
AA 2750
U
GC
A U
AA U
C
A G
U C A
A AA
U
G G
U
AAA2800
A
U
U UA
U A
A
A
U
C
A
G A A
A
U A
U
GU
C
U
G A
G
U A
C CC U
A
C G
C
G G
G
U
A
A
UA G
G G G A
A
G GG U
A
A
U
U
A
G
G AA
G C
C AC
U
C
AC
G
G G
A
A G G U G C
A
A A
C
A U
G
G A
C C
A C
A
C
A A U A
A A
A
A
C G A
U
C
C U A
G G
A
C
A U
C A
G
U
A C
U C
AA
G
U
A
A
A
A
U
U
AG
A
U
A U A U GA
G U G A AG
C
A
A G A
A
G G A A
A A G G
A
U U G A U C
CA G
U AC A U U U U A 3050
U U
C A A
A
U UC
UA G U
C
C
U A G AA A
A
A
A
C
G
A C
A A A UA G
U
G
A
G C U 2850A
C A ACU
U
A
A
U
UC
A
A
2700
A
AU G G U
U C
U A A
C UA
3200
A
U G
U
G UG C C C
C
G
C
A
G
A
A
A
G
U G
U
A
G
G
C
A
G C C A U
U
U A A
G
U
U C C A A G G G
G A
A
AC
G
A
U
U
A
C
G G U A C
U U
A
A 2600
C
G
A C
U
A G
A
A
U
A
A A U
G
U A
U G UG U G A
C
U
G C
U
A
C A
U U
C
A
3100
A
C
A
A
A
C
U 2650
U U A
A
G A
A
A A A GA U C C U A U
A C AG A C C
U
2550
U AA G C C
A
U G
U G A
G G
G
A
A C U G U
G
G
A U
A G
CG
A
U U
C
A
A
A C A G G
C G
A
A
C
A
A
U
A A A A U
A U G A U U
A
U
G
G
A
U
U G A C A U
A
G G
A
U AU C A
A
G A
U
A
C
G
U
U
U
A
A
A A
2500
C
C U G U G U
A
G C G
C
A
U
A
C
A
G A U A C C
G
U
A
C
A
A
A
C
U
U G
A
A
C
A
G
5100
A
C
U
G C
AU
GA
A
U
G
A
AG
U
U
C
U AA AU
G
C
A GUA
A
A
AC
A
A
A
U
A
UUG
C
A
C
G
C G
G U
A
A
AG
G
A
A
G C
U A
G
UA
A
G
CA
A
C
A
A
G
U A
A
AG G
U
GG
CC
AU
UG
G
AUU
U
A
GAA
A
A
C
C
U A
U A
A
C
C
G
A
3700
A
C G
C
U
C
A
U
GU
U
U
A U
G C
U A
U
C
G
C
G
AU
G U U
A U
G U
A
G
G
C
A
A
C
A U
C G
G U
G
U
AU
G
G
UAA
A
C
AAA AAA
A
U
A
U
4500
C
GG
CA
GU
U
AC
G4300
G
CA
GC
G
AU
U
A
AU
GC
A
A
A
G
A
U
A
A
G C
G
A
G
C G
A U
G
AG
A
GC
A
A
A
G
A
C
A
C
C
G
C
G A
U
U
G
A
U
A
C
A
U
U
A
A
UC
UA
G
C
G
C
G
C
G
C
U
A
G
U
A C U
U
A
UUC
G
U
A U
A
A
GC
A
U
A GC
A C AU
G
U
U
G
U
G
A
C
A
A
A
UG
U
GC C
CU
G
CA
U
U
U
C
AG
G
U
A
CG
A
C
AU
UU
G
C
GA
G
G
U 5950
G
U C
C
A
A
U
AA
A
A
5800
U A
A U
A
A
U
AA A C
A
G
U
U
A
A AA
U
U AU
C
A G
G
GG
A
U
C
C
A
C
A
A
G C
A U
U A
U A
A U
5400
G
G
A U
G C
A
G
U
A G C
A
G
G
U
5900
C
G
U C A
A
4100
G
C
C
A
G
A
A
U
A
3750
U A
G
A
A
C
5300
5150
A
C
G
U A
A
AA
A
5350
A
U U
A
AG C
A AU
UA
A A AU
A G
UA
C
G
CC
U
GC G GA
A U
AG
C G
A
U A
G
G G
C
C
G
A
G
G C
C
A U
AA5200
U
U
U
U A
C
UG
A
UG
A
U
A
A A
U A
U C G
G C
A A U A
A UA
A
U
U
A CU G
C
A
U GG C
A
G U
C CG A
G CU
A UG
U UA
A
A
ACGUGC
CGG
AA
C
A
U
A
G
GUC
A
G
A
UU
G
C
C
G
A
C G
A
C GC
A
ACU
A A
C
C G
G
CU
U
GA
G
U
AG
U
A
A
G
G
U
A G G
3900
2950
U A
U A
A
U
A
3950
A
U A
A U
C G
U A
U
GA
G
C
A
U
U
U
A
1950
A
A
A
C
U
U
4950
A
C
U
U
A
G
A
A
G
A
U
G
A
U
A
C
G
G
C
GA
C
G
CA G
G
G
G
A
A
G
G
A
C
A A
G
C G U
G
C G U U
G
AA
G C
G C
C
U
A
1900
A
U
U
C
U
C
A
G
A
A
C
G
G
G
C
C
G
A
A
G
G A
G
A
C
G
G
C
A
U
A
U
GC
G
A
U
C
G
C
G
U
G
A
U
GG
C
U
G
G
A
U
U
A
GU
C
G
U
A
G
C
A
U
A
G
G
A
U
U
A
C
UG
A
A
A
G
A
A
GA
UCA
U
C
G
U
A
A
U
C
G
U
G
G
1850
A
C
G
6350
A
A A
A
A
C U
A C
A
A G
A
AG A U
G
U
G U
C
U U
A
U
C
G U
UU
C
C
U
G G
GA
C A
U
CC
AC
A A
G U
C A
G
G
C A
G G G
U U
G
A
A U UA
C A
G U
G
C G
A
G U
A
UG
C U
G C
C A
G
G A
CU
A
G G
A U
A C
G U
G
A C
U U
9050
C U
9100
C C
C G
A
A U
A U
UU
G U
6400
G G
A A
G U
C G
A
A
G G
CU C C A C
A
A
G G
C A A
A C
U
A A
A
U
G
A
U A
C C
U
C C
U U
G A
A
G
A C
U
C G
A
C
G G
C U
U G
U G
C U
C G
A
A
G
A
G A
A G
C G
G G
G C
A
A
C
G
C
G CA
G
U
A
G C C
U U C CG
U G 1800
CA G
A A A
G
A
A
A U G AC
A
CG U
G A A A
C
C A
G
U AA
C
U G AU
UU
C U
C
A
U
G
G A
U
A
A
G1700
U
C A
A
A
GA
U
A
A A A
A
AA
U
G
G
G
A
A G
U
C
U
A
U
C
U
U
G
U
A
A
G
A
C
C
A
U
A
G
G
AG G
A
G
1750
C
A
C U A
A A
G A
U U
U
U
AGA
C U A
G
U
C A
U
A
C
U
G
C
U
U
C
C
A
A
C
U
A
1550
U
G
U
A
G
C
A
G
A
UA
C
G U
AG
C
C U U
C
UC C G
A G G
A C G
AU U
U
C
A
U
A
CA
G G
C
GA A U 1600
U
G
C A
A
A
A
U
A
U
C
U
G
A
A
G
C
G C
A A
G
GA
1500
U
U G
C
G
A
1450
A A G A G
A
C
A
G
A 1350
U
U
A
A
C
C
A
G
U
A U
C
A
A
C
G
U
A
A
C
A
G
G
A
U
U
A
U
U GC
A
G
G
GU
U
A
A
A
G
A C 1400
A C
A
A
G
U
G
A
U A
C G
G C
AG U
C
G
A
C
U
G
G G
C
U
C
C
A G
A
CG
C A
CC G
C
AAA
G
A
A
G
A U
U G
A
G
G C
A U
C
A
C
U
C
A
G
C A A A
U G
C
A
C
U C
A
G
C
U
C A
G
C
U
G
G
C C A
C
6550
G
U
C A
U
A A
A
A
U
G
G G
A
A
C
C
U
A
6500
C
U
A
A
C
G
C
A G
C
A
U
U C
C
C
A
A
A
A
G
A
A
A C
A
U A
A
G
U
G
G U
A
A
G
A U
A
A
U U
G U
U
U G
U
U
C G
A
G U
U G
U A
G A U
A
U
A
A C
GA A AA
A
U
U
G
A
C A
G
G
U
U
U G
A
U G
A
A
A
C
AC
C
A
G
G
A
G
A C A C
U
G
U
G
U
A
A
A G
C
A A A
U G
C G A C 8950
A
C
U
A
U
G
6600
C
A C
C
A
U
G
A
G
A
A
A
C C
G
U
G
A U
G U
G G
A
U A
A A C
C A
A
A
A
A
A
G
A
G
A
U
GA
A
G
A
G
U
C
G
A
G
G
A
C
C
6750
A
G
A
A
C
A
C
U
AC
9200
A
G G
A
U
G
U
U
A
U
C
C
U
U
C
C
G
G
A
A
C
C
A
A
8250
G
U
A
A
A
U
AG
G
GG
C
A
U
U
8800 U
A
U
G
G A
G C U
C
U G
U
G A A
G G U
A
G
C
C
G
U
G
A
U
U
G
U
A
A
G C U G
G U U U
UA
U G C G
A A
A
A
C
G
G A C C
A U U C
G
A
A
A
A A A A
U
A
A A U
G
C G C U
A
A
U
A A G A
G
C
A
A
U U A
8200
A
G
G
A
A G
A
U
U
A
A U U
A
UA
AUA
A GA
A
U C
U
G
A A
C
A
G G A
C C U
CG
C
U
A
U
G
A C A
A G
C
C
A
U
A A U G
G
C
A U
A
C A
A C
U A
G U A
U G
A U
A
A C
C A
7000
C C
A
U
U
C A
A
U U AAC
A
G G
G G C
G U A G
U A A
U U C
U A
G
C A
C A U
A A
A
G U
GU G G
C
AC
U
C
A AG
G
U U C GG
CC A A
U
A UG GAC A
8550
C
C
U
A G
G
UA
A
C U
G
U
A
A C A G
C
A
G
U
GA
C
C
A
A
G
G U A
G A U U A
U
8450
G
C A
A
U
C
C
C
U
A
U
G
C U U
C
A
U
C A
G
G
A
G
U
U
C A UU
C
C G A A CC
GG U G AU
GU G G
C
G A UG
G U
8500
A
G
C
U
U
G
A
U
A
A
G
A
A
U
G
6950
U
C
G A A
U
A G
G
A
A
C
G
A G
G
A
C
AG
A
A
U
G
A
A
C
G
A
G
A
G
G
C
6900
U A U U
G C C
A A U U
CAU
U
G
C C
U U U
A C A
G G U
U A A G
U G C
G
U G U
C C
A C G
G G
U
U G A
U
U G A
A A
G
U
G
U
A
G
C
G
C
A
A
A G
A
G
A
G
A
G
U
A
U
A
C
U
A
U
C
G U
A
G
U
C GA
G
A
A
G
C
G
A
A
U
C
G
C
C
A
A
C
A
A
U
G
A
G
A
A
G
G
U
G
G
C
C
G
U
G
U
A
U
A
G
U
A
A G
A
C
U U G
C G
A
A
C
AA
C
A
AU
C U A
C C A
6850
U
G
G C U
G A
U
G
U
U G G U
U
G
A
A C
U
A
U U G U
C G G
A
U
A
C
A C C U
A
A
C A G U
U
A A U A U
C A
U
U G G
U
A
G U C G
A GU
A
A
U
G U
U
A
U
U
U
G
8300
A
G
U
A
A
U
A
G
U
U
U
C
G
U
U
A
A
G
A
U
G
G
G
U
A
A
C
G U A 8350
U
C
A
U
A A
C
G
C
G
U
C GA
C
C
U
U
C
G
8600
A
U
C
A
U
U
AC
G
AC
G
UG
U
A8750
UC C
AG A
8700
A
AU
U
GU AC A
G
UG G
A
AC
C
GA G
U
U
AG A
C C
UC U
G AU
A
G U
A
G
U
U
C A
A
U
G
G
G
C
G
U
G
G G
A
U A
G
U
C
U
G AC
C
8850
U
G
AG
G A
U
CA
G
AC
G
U
AU A G
G G GC
U
G
A
U
G C
G
A
A
C
AUG
A
GA
A
C
U
C
G C
C
U
U
AU A
G
UC GG
C
GC
U AC
A
A U A
C C
G
C U AA
G A G
C
A
9250
G
C
G
C
A
U
A
6700
A
G
A
AG
U
A
C G
U AA A A
A
G AA
G
G
A
A
A
A
C
G C
A
U
G U U
G A A
A
A U
G G GC GA C
AA
G A U
GA U A C U U
U U
A A G
A G
A
U
G G
G A U U
A U
U U
CA
C U
A
A
U
G U
A
A
A
A
G
A
A
C
U
U
C
U
U
C
G
C
A 6800
U
A A A
A
G
U
A
U
A
U A
A
C C
A A
G U
U A
U
G G
A
U U
A U
G A U
A A U
G
A
G U A
C
U
U
U U A
A
U
C
C
A
G
A
C
A
C
C
8400
G
U
C
A
G
C C
C
C G A
A
G
A
C
U
C
G
U
G
C
G
C
C
A
G
G
A
A
G U
U A
A C
U
C C
U
U
C A C
U G
G
G G
U C
U A
AA
G U
U G
A
U
C
6650
A G
A C
A
C
G
G C A
U
A G U
A U G A
8900
A
C
U
A
A
C U G
A
A
A
U C
G
A
G
A U
U
G
A
U U G
G
U
G A
GC
A A
C U
AA
A
G C G
G A G
A G
AU
U
G
U
A
U U
G
A
G
G
G G
A
A G
C
C G
C
A
C
C C
U C
G
U
A U
U
A
A
G
A
U
A
C
C
U
8650
G A
U
G
A
U
G
C
C
G
U
A
U
G
G
C
C
A
G
U
A
C
G
A
A
U
A
G
C
U
G
G
U
U
U
A
G
C
A
A
G
C
CC
G
A GA G
U
G
C
A
U
A
U
A
G
U
A
AG
GU
AG
C
UA
CCA
G
C
CA
A
G
UC
GC
UG
A
AC CA A
G
AA U AU
C
U C AG
G
UU
A
C UC A
A
U
C
U A C
AG
U
C
U
G
9150
A
C
A
A
A
A
A
C
A A C C A
A
U
A
A
U
A
G
C
U
G
A A
A A U
7150
G
C
G
A
C
AA
A
7050
U A
U
A
U G A
C
G
7100
G U
G
U G
C
C U
A
A U C
G
C
A A
C A
G U
C G
A C AC
C U
U U
U
A A
A
G A
U C
A U
C
G
A
U U G
A C
G A A
C U
A
C
U A
C
U U U
G
A
A
A
G
A A G A
U
G A
C A
C
U
A
A A A
A
G
U U
G A U A
A
A
A A A
U C U
G G
A U U
G U C
G
A C
A A
C
A
G GA
A
C U
A G A
A
G C
A
C A G
G G
A
U C
G
A U A
A G
C A
8150
7200
A A U
CG
U
G A
G A
A
A
7300
A
G A
U
C
U U
A
A
U
U A U
A
G
A
AG G GU U A
C AA A
A
A A
C C A
A
A
U
U U A
C
U
G
A
G G G
G
C
A AU G
U U A
C
A
U
A
G C
G G
G
AA A
A U G
G
A
C A G
A
A
U C U
C
G
A C A
A A
A A G
G
A
A
C
C U A
G
G
U C
A C A G
A
7250
A
G U
G
G
A
A A U G G
A
G
A
U
A
U G A U U
A
G
7350
C A C U U U
C A AA U AA U
A
A
A 8100
G
G
A
A AG C
A
A
U
A
U
G
C
G
U U
A
A
7400
A
A U
A
U A
C
U
U A
U
A
A
A
G
U U U A
A
G
A U
U
G A
U G
A
C U AU U
AC
A A
A
A
A GU
U
A G C A G U C C U
A
A U A
G
C C C
A A U
U
A G C
G UA GA U GU G U
G A
A A
A
C A
C
U
A
A
A
A
G G
G
U A U GU AA G
A
C
U G
U AA
U A
G A U A
A A
G G A G GA GA G
A
A
A
A G AG
A
A
G
G
A
U C G U C G G G A
U
U U
U
C
G CG
U U U U U C U A
G U
A
U
C
G GA C A
U A
C
C
A
U
A
AC
G U U
UG
C UG
A
C C A
7650
G A
U U C
G
G G U
A C
A
A C A
C A C A
U
C
A
G
U
U G U U U A
CA
A C CU CU U C C C C CU
C
U
G A
UG
A A A A G G A U G
C A
U
7700
7750
U
A
U A
G U
U U
A U A U C U 7600 A G A
A C
A
A A
G G
G G G
A G G
G
U
A
G G A
G U G U
C A C U
U C
G
A
A C A A A U
C
G
A
A AA A
A U
U A
G U
G
A C
C C U
C
U C
C C
A A
U G
U
A C G
C
G A G
A G
G
A A A U
C
U
A
A
G U G A
G
A G
A A
C
G
G
U GUAA
GA
A G U A
C
A
C G
G G
A
U
AG
A GA
C
GU
C
A G
U
7550
7450
U
A
A
A
A
U
C
A
U
A U A
C
G U C
A
A G
A
A A
U
A A A GAG
C A
A
U
G
AA
CA
G G
U AC
C
C
G
AA
A
U
GA
GUG
G
A C
AAUAAG
G
C A
C
7500
A
G GA
C G
G
A
U
G
U U
A
A
U
A
G G G
G
A AG
U G
C C CG
C
7800
C
U
C U C U
U
G UA
U G U G U
A
A
C A A
A
A
A
8050
U
G C
G
C C
G
G AUU
U U
G U C U
G
G A
G
U
G
A
G
U
C G
A U C
U C
C A
G G G
U U
A
U A G
A
A
G C
A
A
8000
G C
U G
A
U
G G
A G
C
G
G
A
G U C
C
U
A
A
A
C
G
U A
G
A
A
7900
G C
G
G G
C A
A U
C
U
A
A C
C
C
G
G C
U A
7950
C
G U U
G
G
U G G G
A U
G
ACA A
G
U A
A
A UU
G U
U G A
C
U C
C CA G C
G G
U U
UC
U C
G
C
A
C
C CA
G
G GC C
U
AGAG
A AG
AG CC
G UG G
C
GA U
A GA A U UC C
A
A CU A
U GG
C
C G
G CU A
G
U G
U AA
C U
A A C
U
G C GA U
G
U
7850
A G
G U
U
A
U
U
AAA
C G
A
A U
C G
G C
G C
G C
G
G
G C
A
U
G
G
U
G
G AU
A A A
G
A
C
C
A
1200
1250
C G
U G
G
U
C G
U G
U
A
A
A 9500
C
U
U
C G U
G
A
G C
G C
A
G C
C G
G
G C
A U
C G
A
A
G C
G C
A C
U C
GG
G
C
A
G
G
C
A
U A A
A U
C G
G C
A U
G
A
A
U
C
C
G
G
G
G
A U
C G
U G
G C
A
A
C
C A
C
G
C
C
U
A
U
G
U
UU
C
A
A
U
U
UA
U
C
G
A
U
A
C
G
U
A
C
C
U
G
AG
C
A
U
A
G
A
C
U
U
C
A
A
A
C
A
U
A
G
G
A
9450
U A
G C
1300
A
U
U A
A
C
G
G A
U
C
G A
A
U
A
C
U A
G
A
U
A
U
C 6450
A
G
A G
G U
A C
U C
A
C G
U G
U
C
A
G
G
CA
G
C UAA U
C
GAG
U
C
U A
C
A
A
G
U
G
G C
C
G
A U
U G
C
G
A U
U
G
G C
A
A 9350
AA
A
C
G
G
CG
C
CA
C
U
A
A
A
G U
G 9400
C G UC
UA
G
G
C
C
G C
C
A
G
U
A U
A C
C G
A U
GC
U AU
G U G
A U
A
A
G
U
U
G
A U C G
A
C
A
AG
U
A
A
A
A
A
A
G
U A
G UG
A U A
C G
C
U
A
C
A
A
C
U
A U
ACC G
AGA
A
U
A
G
A
C
C
GG
G
AC
G
G
G
C
C
A
C
G C
C
U
A
G C
C
A U
A
G
C
U U
G
G C
C G
U A
A U
G
A
U U
U U
G U
G C
G A
A C
A C
G
A
9000
U
CA CA
U
G 9300
A
C
C
G
A
A
C
G
A
C
G
G
C
U
A U
A
C
U
G
C
A
G
U
C G
C
G
A
A A
U
A
G G
A
UC
AA
G
A U
G U
A
G
A
U
C
U
A
U
G
G
U
A
A
A
U
G
A U
U A
A
U
G
A
U
A
U A
G
A
A
C U
A
G
AA
G
A
1650
A
A
U
A
G
A
A
U
A
A
G
A U
C G
G
C G
U G
A U
C G
G
U
U C
A
A
C
G
A
G GC
AAUA
UA G
C
U A
G C
U G
C G
G C
C G
C G
A U
C G
G C
U A
A
G C
A U
A U
C
G U
A U
U
C G
U G
A U
C G
G C
A U
A
GA
A
A
U
G
A
A
A
A
C
A
A
G
C
A U
U A
U A
U A
C
G
G C
A U
A U
G C
G C
G
A U
A
U
C G
C G
A A
A U
A
U
A
G
G C
A
G
G
A
G U
A
A
C
U
A
G
G
U
AU
G
UA
AA U
AG
C
U G
C
G C
U
G
A 1000
G
A
C
U
A
A
A
A
G
G
A
A
A
G
GC
AG
A
U
U G
C
A U
U A
C G
C G
A
A
U
G
U
U
U
900
A
G
A
C
U
C
A
G
G
C
A U
U
G C
A U
U G
A U
U G
A U
C G
A U
A
A
C
U
A
A G
C
U
G G
G
C G
G C
G C
G C
U G
C
C
A
C
A
U AU
C
U
C G
U
G
U
U
A
G
A
U
A
G
G
U A
C G
A
G
G
A
C
A
G
A
C
U
C C C
U
A
U
C
A
G C
G C
9700
A
G
A U
U
CC
A
A
C
A
U
U A
C G
G U
A
950
G
U
A
A
G
G
A
A
A
C
A
A
C
U G
C G
U
GG
GA U
C G
G
G
U
CA
A
UA
G
UA UA
AU
G
A GC
U
U
U
A
U
C
AG G
G
UC
U
C CA A
U
1050
AG G
C
U
A
G
AC
UC
C
A
A AA
A
G
U
C C
A
U
C
A
G
G C
A U
9650
A
A
C
A
C
G
U
G
U
G C
G U
A U
A U
C
U
A
9550
G
G C
C
A
U A
C G
A
A U
C G
A
G C
G
A
AG
U
A
G
G
A
A
A
1100
A
C G
G U
A
A U
A
A
G
C
A U
G C
UA
GA
A
A
C
A
G
C G
C G
A
A A
U C A U 9600
C
G
C G
G C
C G
C G
A U
G U
A
G
C
C G
A GC A U
AA
C
A
1150
A
G
C
A
U
A
A
U
A
U A A
G
A
A
A
C
G
A
C
U
A
A
G
A
UA
A G
U
G
A
A
850
G
G U
G C
G
U
G
A
A
9750
C
GU
GC
C
A
G
G
G
U
A
G
G
AG
G
A
G
A C
G C
UA
U
A U
C
G U
A U
G
A
U A
C G
G U
G C
GA
A
G
G
A U
C G
G
U
AU
U
G
G
A
U A
A
UC
C
U
C
G
GA
AG
G
A
C
700
G
G
U U
CU
A
G
C
A
A
A
U
G U
G C
U A
G C
A U
G C
G C
G G C
CU
A
G
G
G
G
U
A
CUGGA
GG C C
G CC G
U
A
AG
G
A
C G
G
C
C G
A
G
A
G
G
G
A
A
U
U
350 A
GAC
UUU A
G C
AG UC
C
G UG GC C
G UG AC U
A U A GU
G C
G C C G
A U
G C
U A
C G
G
G CC G
U G
A
U
A U U 400
C G
G
U
C G C A
C
U G
C
U C
U
G
U
G
A
C
G
G
G
C
A A
C
G
G G U
U
U
U
CCA
A
A
A
C
AA
C
A
U
GG
G
C A
UCCGGG
U GAA CU
U
G
G
GA
C
U
A
C AC
G UG
AG
A C GA
AG C A
A
U U UA
C A
A U
50
C
U A
G C
GA C
U
C
G UC
U AG C
C GA U
A
G C
G
G
U
A
A U
U U
G
C G
A 200
U
A U
A U C G
C G
G
A
G C
A
C A
A
G A
G
A
GA
AU
C U A
A
A
C
C
G G
C U A
UU 150
G C G C U
A
UA AU
CA
C
U
C
C GU
A
G UG
U A
U
G
CG C
A
A C
G
G C
C
A
C
U
GCU
G
A
U
A
C
U
C
G
A
UCG
AG
UU
UA
U
C
U
C
G
C
G A U
C
550
A
C
G
C
G
C
GU
C
AG
C G
AA A
UU U
CA
GU
C
GC A U
UG
CU G
GA C
GU G 600
UG U
UC
G C
G
500
G
C
AU
750
C
U
A G
A
U
G
C
G
GA
CU
C A U
GA
G
C
C A U
A
G
U U A
CU
C
G
C
G
G G G
U
A
U
C
U C G C
U G
C
G
C
A
U
U
A
G
C G
U
U C
U
G
AU
C
G
A G A
U A
U
U
C
GC
U
G
AG
C
C
A
G
U
A
U
G A
C C
G
U G
A
C C
GG
CG
AU A
AA U
CA
G
AA
AU
G
C
G
U
U
A
U
G
UA
650
C
U
C
U
G
U A A
5’
C G
C G
C G
G C
A U
G
C
G
U
G C
A
C
450
U
A
A
A
A
U
U
U
G
C
C
A
GA
U
A
A
A
A
G
G
U A
C
G
ACG
CU
A
G
GG
C
CC
GG
U
CU
G
CG
GG
C
G
A
G
C
G
G
C UA
AG G
C U A G A
C U
U G
A
UU
U
U
A
A
A
A
800
G
A
C
C
C
G C
C G
A
A
GA
AU
C C
C
U G
A
C
A
C
C
A
G
G
100
C
U G
G
C G
U G
AU U
G
U
C G
G
G CG C A
A
A
C
CA
CG
G
U A U
AU
G
G
C C G A
G C A
U A
U A C
AU GG C
A U
C G
300
A
G
AA UU
UG AC
C U AC
GC G
A A
A
C
U
U A
A
250
G U
A U
U A
C G
A
C G
G U
G
A
C G
C G
A
G
C G
A
G
U
U U
Figure 1: The optimal secondary structure of an HIV-1 virus with 9,781 nucleotides predicted
using GTfold in 84 seconds using 16 dual core CPUs. The minimum free energy of the
structure is -2,879.20 Kcal/mole.
Viral sequences range in length from about 1,000 to over 1,000,000 nucleotides in the
recently discovered virophage. Length of the viral sequences poses significant computational
challenges for the current computer programs. Free energy minimization excluding pseudoknots is a conventional approach for predicting secondary structures. The mfold [20, 11] and
RNAfold [9] programs are the standard programs used by the molecular biology community
for the last several decades. Recently, other folding programs such as simfold [1] have been
developed. These programs predict structures with good accuracy for the RNA molecules
2
having fewer than 1,000 nucleotides. However, for longer RNA molecules, prediction accuracy is very low.
According to the thermodynamic hypothesis, the structure having the minimum free
energy (MFE) is predicted as the secondary structure of the molecule. The free energy of
a secondary structure is the independent sum of the free energies of distinct substructures
called loops. The optimization is performed using the dynamic programming algorithm given
by Zuker and Stiegler in 1981 [21] which is similar to the algorithm for sequence alignment
but far more complex. The algorithm explores all the possibilities when computing the MFE
structure. There are heuristics and approximations which have been applied to satisfy the
computational requirements in the existing folding programs.
One potential approach to improve the accuracy of the predicted secondary structures
is to implement advanced thermodynamic details and exact algorithms. However, while the
incorporation of these improvements can significantly increase the accuracy of the prediction,
it also drastically increases running time and storage needs for the execution. We use shared
memory parallelism to overcome the computational challenges of the problem.
We have designed and implemented a new parallel and scalable program called GTfold
for predicting secondary structures of RNA sequences. Our program runs one to two orders
of magnitude faster than the current sequential programs for viral sequences on an IBM P5
570, 16 core dual CPU symmetric multiprocessor system and achieves comparable accuracy
in the prediction. We have parallelized the dynamic programming algorithm at a coarsegrain and the individual functions which calculate the free energy of various loops at a
fine-grain. We demonstrate that GTfold executes exact algorithms in an affordable amount
of time for large RNA sequences. Our implementation includes an exact and optimized
algorithm in place of the usually adopted heuristic option for internal loop calculations, the
most significant part of the whole computation. GTfold takes just minutes (instead of 9
hours) to predict the structure of a Homo sapiens 23S ribosomal RNA sequence with 5,184
nucleotides. Development of GTfold opens up the path for applying essential improvements
in the prediction programs to increase the accuracy of the predicted structures.
The algorithm has complicated data dependencies among various elements, including
five different 2D arrays. The energy of the subsequences of equal length can be computed
independently of each other without violating the dependencies pattern introduced by the
dynamic programming with a set of five tables. Our approach calculates the optimal energy
of the equal length sequences in parallel starting from the smallest to the largest subsequences and finally the optimal free energy of the full sequence. We also describe the nature
of individual functions for calculating the energy of various loops and strategies for parallelization.
2
Related Work
Several approaches exist for RNA secondary structure prediction. While this paper focuses on
free energy minimization, another approach [15] predicts secondary structures by computing
the structures of smaller subsequences and using them to rebuild the full structure. Though
3
this latter approach runs fast, it is an approximation, and can miss the candidates that do
not follow the usual behavior. Also, the success of these kinds of approaches is dependent
upon the ability of rebuilding methods to identify motifs correctly by consistently combining
the substructures into a full structure.
Several distributed memory implementations of RNA secondary structure prediction have
been developed; however, they may not be portable to current parallel computers (e.g.,
[12, 5]) or do not store the tables necessary for finding suboptimal structures (e.g., see
[9]). For instance, Hofacker et al. [9] partition the triangular portion of 2D arrays into
equal sectors that are calculated by different processors in order to minimize the space
requirements and data is reorganized after computing each diagonal. Traceback for the
suboptimal secondary structures is not possible because it requires the filled tables. In [8],
the authors observe that to fold the HIV virus, memory of 1 to 2GB is required, dictating
the use distributed memory supercomputers; yet in our work, we demonstrate that this can
now be solved efficiently on most personal computers. In our work, for the first time, we
give scientists the ability to solve very large folding problems on their desktop by leveraging
multicore computing.
Zhou and Lowenthal [19] also studied a parallel, out-of-core distributed memory algorithm
for the RNA secondary structure prediction problem including pseudoknots. However, their
approach does not implement the full structure prediction but rather studies a synthetic
data transformation that improves just one of the dependencies found in the full dynamic
programming algorithm.
3
RNA Secondary Structure
RNA molecules are made up of A, C, G, and U, nucleotides which can pair up according to
the rules in {(A,U), (U,A), (G,C), (C,G), (G,U), (U,G)}. Nested base pairings result into
2D structures called secondary structures. There are 3D interactions among the elements of
the secondary structures which result into 3D structures called tertiary structures. Pairings
among bases form various kinds of loops, which can be classified based on the number of
branches present in them. Nearest neighbor thermodynamic model (NNTM) provides a set
of functions and sequence dependent parameters to calculate the energy of various kinds of
loops. The free energy of a secondary structure is calculated by adding up the energy of all
loops and stacking present in the structure.
Figure 2 shows an MFE secondary structure predicted by GTfold of a sequence with 79
nucleotides. Various loops annotated in the figure are named as hairpin loops, internal loops,
multiloops, stacks, bulges and external loops. Loops formed by two consecutive base pairs
are called stacks. Loops having one enclosed base pair and one closing base pair are called
internal or interior loops. Internal loops with length of one side as zero are called bulges.
Loops with two or more enclosed base pairs and one closing base pair are called multiloops
or multibranched loops. The open loop which is not closed by any base pair is called an
external or exterior loop.
4
Hairpin Loop
30
A A
A
C
A G G
C C
G G
C C
G G
C
20
A
A
Internal Loop
A A
A
A
C
A A
G
C
40
G
G
Multiloop
Hairpin Loop
C
C
10
G
G
A
C
G
A A
50
A
C
A
A
A
A
G CA
C G
G C
C G
5’
G C
C G
A
A A
A
C G C G C G
A
G C G C G C
A
A
A
A
A
70
Stack
External Loop
Figure 2: A sample RNA secondary structure with 79 nucleotides.
5
60
4
Thermodynamic Prediction Algorithm
Prediction of secondary structures with the free energy minimization is an optimization
problem like the Smith-Waterman local alignment algorithm. There is a well-defined scoring function which can be optimized via dynamic programming, and structures achieving
the optimum can be found through traceback. However, while sequence alignment can be
performed with one table and a relatively simple processing order, RNA secondary structure
prediction requires five tables with complex dependencies. Each class of loop has a different
energy function which is dependent upon the sequence and parameters. For the internal
loops and multiloops with one or more branches, all enclosed base pairs need to be searched
which makes the loop optimal for the closing base pair.
The algorithm can be defined with recursive minimization formulas. Simplified recursion
formulas are reproduced here from [10] for convenience. Pseudocode of our algorithm that
implements thermodynamically equipped recursion formulas is presented in Appendix A.
Consider an RNA sequence of length N, free energy W (N), and index values i and j which
vary over the sequence such that 1 ≤ i < j ≤ N. The optimal free energy of a subsequence
from 1 to j is given with the following formula:
W (j) = min{W (j − 1), min {V (i, j) + W (i − 1)}}
1≤i<j
(1)
In Eq. (1), V (i, j) is the optimal energy of the subsequence from i to j, if it forms a base
pair (i, j). It is defined by the following equation.
⎧
eH(i, j),
⎪
⎪
⎨
eS(i, j) + V (i + 1, j − 1),
(2)
V (i, j) = min
V BI(i, j),
⎪
⎪
⎩
V M(i, j)
Eq. (2) considers loops that a base pair (i, j) can close. The eH(i, j) function returns the
energy of a hairpin loop closed by base pair (i, j). Function eS(i, j) returns the energy of a
stack formed by base pairs (i, j) and (i + 1, j − 1). V BI(i, j) and V M(i, j) are the optimal
free energies of the subsequence from i to j in the case when the (i, j) base pair closes an
internal loop or a multiloop, respectively.
V BI(i, j) =
min
{eL(i, j, i , j ) + V (i , j )}
i<i <j <j
(3)
where, i − i + j − j − 2 > 0.
The formulation of the multiloop energy function has linear dependence upon the number
of single stranded bases present in the multiloop. The standard is to introduce a 2D array
W M to facilitate the calculation of V M array. Eq. (4) and (5) shows calculations of W M(i, j)
and V M(i, j) respectively.
⎧
V (i, j) + b,
⎪
⎪
⎨
W M(i, j − 1) + c,
W M(i, j) = min
(4)
W M(i + 1, j) + c,
⎪
⎪
⎩
mini<k≤j {W M(i, k − 1) + W M(k, j)}
6
V M(i, j) =
min
{W M(i + 1, h − 1) + W M(h, j − 1) + a}
i+1<h≤j−1
(5)
These minimization formulas can be implemented recursively as well as iteratively. We
implement an iterative formulation of the algorithm in GTfold, described later in this paper.
The implementation uses various 1D and 2D arrays corresponding to W (j) and V (i, j),
V BI(i, j), V M(i, j), W M(i, j) values. Also we use calcW (j), calcV (i, j), calcV BI(i, j),
calcV M(i, j) and calcW M(i, j) functions to calculate the values of W (j), V (i, j), V BI(i, j),
V M(i, j) and W M(i, j) array elements.
4.1
Parallelism
The dynamic programming algorithm is computationally intensive both in terms of running
time and storage. Its space requirements are of O(n2 ) as it uses four 2D arrays named V (i, j),
V BI(i, j), V M(i, j) and W M(i, j) that are filled up during the algorithm’s execution. The
main issue is running time rather than memory requirements. For instance, GTfold has a
memory footprints of less than 2GB (common in most desktop PCs) even for sequences with
10,000 nucleotides.
The filled up arrays are traced in the backwards direction to determine the secondary
structures. The traceback for a single structure takes far less time than filling up these
arrays. Time complexity of the dynamic programming algorithm is O(n3) with the currently
adopted thermodynamic model. The two indices i and j are varied over the entire sequence,
and every type of loop for every possible base pair (i, j) is calculated. This results in the
asymptotic time complexity of O(n2 )× maximum time complexity of any type of loop for a
base pair (i, j).
Computations of internal loops and multiloops are the most expensive parts of the algorithm. We can see from Eq. (3) that, in the calculation of V BI(i, j), all possible internal
loops with the closing base pair (i, j) are considered by varying indices i and j over the
subsequence from i+1 to j −1 such that i < j . This results in the overall time complexity of
O(n4 ). To avoid large running time, a commonly used heuristic is to limit the size of internal
loops to a threshold k usually set as 30. This significantly reduces running time from O(n4 )
to O(k 2 n2 ). The heuristic is adopted in most of the standard RNA folding programs.
Lyngsø et al. [10] suggest that the limit is a little bit small for predictions at higher
temperatures and give an optimized and exact algorithm for internal loop calculations which
has the time complexity of O(n3 ) with the same O(n2) space. The algorithm searches for
all possible internal loops closed by base pair (i, j). Practically, this algorithm is far slower
than the heuristic. Choosing one of the options is a tradeoff of running time versus accuracy.
In GTfold we provide an option for the user to select the heuristic or internal loop speedup
algorithm. Also, our parallelization scheme is valid for both the options.
Thermodynamics of multiloops are still not understood fully, but improvements continue
to be made. Searching for an optimal multiloop closed by a base pair (i, j) requires searching
for all enclosed base pairs which makes the loop optimal. To make the multiloop energy
function feasible to compute, it may be approximated in O(n3 ) time. This function has linear
7
dependence upon the number of single stranded bases in the multiloop. Time complexity
of the algorithm to implement a relatively more realistic multiloop energy function having
logarithmic dependence upon the single stranded bases in the loop is exponential. Also, many
other advanced thermodynamic details such as coaxial dangling energies are not implemented
in the multiloop energy calculations during the optimization, as it significantly increases the
running time.
Both running time and spaces needs are expected to increase with the use of better
thermodynamic models. While memory requirements can be satisfied with today’s highend servers with 256GB or more memory, running time will continue to play as a major
prohibitive factor in solving these grand challenge problems. Our parallelization approach
in GTfold is designed to keep all these factors in consideration.
5
5.1
GTfold
Dependencies and Access Patterns
Figure 3 shows a general ij plane. A valid base pair is defined as (i, j) where j > i. Thus, only
the upper right triangle is valid. Secondary structures can have only nested base pairings,
meaning if there are two base pairs (i, j) and (i , j ) such that i < i < j then the constraint
i < i < j < j is also satisfied. This assumption of nested base pairings results in the general
dependency of an element (i, j) on the elements in the Δ ABC as shown in Figure 3. To find
the optimal loop formed by a base pair (i, j), we need to search for all enclosed base pairs
over the subsequence from i + 1 to j − 1. In the case of internal loops we need to search for
one enclosed base pair while for a multiloop we need to search for more than one base pair.
In this fashion, the computation of all types of loops for an element (i, j) follows the above
dependency pattern.
The speedup algorithm for internal loop calculations follows the same general technique
but its access pattern differs. It updates the elements outside the dependency triangle Δ
ABC shown in Figure 3 for the element A. It is an optimized algorithm to reduce the space
complexity. We define a small internal loop as a loop in which at least one of the sides is
less than a constant c. The algorithm scores these small internal loops as special cases by
applying a function derived from the general internal loop energy function. If an enclosed
base pair (ip , jp ) is better than another candidate of the enclosed base pair (ip1 , jp1 ) for the
closing base pair (i, j), then the enclosed base pair (ip , jp ) will also be better than (ip1 , jp1 )
for the closing base pairs of the form (i − b, j + b), where b is a positive integer such that
1 ≤ b ≤ min{i − 1, N − j}. This algorithm uses the best enclosed base pair (ip , jp ) for
the closing base pair (i, j) to evaluate all internal loops closed by the base pairs of the form
(i−b, j +b) at the same time. This way, at element (i, j), the elements of the form (i−b, j +b)
are also accessed. The access pattern of this algorithm is shown in Figure 4 excluding the
calculation of special cases.
8
j
A (i, j)
B
i
C
Figure 3: The implicit dependency of point A on the elements present in the triangle ABC
j
i
Figure 4: The access pattern of V BI(i, j) for the internal loop speedup algorithm
9
5.2
Approach
In the region of the general 2D ij plane corresponding to j > i, a point (i, j) corresponds to
the computation of energy of the subsequence from i to j. The dependency pattern shown
in Figure 3 allows the calculation of all the elements existing on a line j − i = k to be
independent of each other, where k is any integer from the set {0, 1, 2, . . . , N − 1}. This way
the computation of the line j − i = k can be performed in parallel, and the whole space can
be computed by considering subsequent lines from k = 0 to k = N − 1. Note that the points
on one of the lines corresponds to the equal length subsequences.
j
i
Figure 5: Showing the pattern of computation implemented in GTfold
Algorithm 1 arranges the nested for loops to compute in the manner described above.
The first for loop runs for different lines starting from j = i to j = i + N − 1 and the second
for loop calculates all the points on one line in parallel. Figure 5 shows the sequence of
these computations. This parallelization strategy is suitable for future improvements to the
thermodynamic model or to optimizations for computing the various energy functions. This
coarser level of parallelism enables us to exploit more concurrency while offering compatibility
for possible future improvements.
There are other orderings of the computation that cover the whole space without violating
the dependency pattern. One way is to compute the elements column-wise, starting from
j = 1 to j = N. On one column the computation is done for the increasing values of j − i,
i.e. from row i to row 1. A second way is to compute the elements row-wise, starting from
i = N to i = 1. On one row the computation is done for the increasing values of j − i, i.e.
from column j = i to column j = N. These two ways achieve a higher degree of spatial
locality but they are inherently sequential.
10
input : Sequence of Length N
output: Optimal Energy of the sequence
begin
for b ← 0 to N do
#pragma omp parallel for schedule (guided)
for i ← 1 to N − b do
j ← i + b;
calcVBI(i, j);
calcVM(i, j);
calcV(i, j);
calcWM(i, j);
end
calcW(b + 1);
end
return W (N);
end
Algorithm 1: Main function to compute the secondary structure of an RNA sequence
Parallelism at individual functions
Parallelism can also be exploited at the finer level of individual functions which compute
the energies for the various kinds of loops for a closing base pair (i, j). The general pattern
of different functions for calculating the energy of these kinds of loops is the same except
for the function that computes internal loop energies using the speedup algorithm. The
general pattern is to consider various possible options of the corresponding type of loop
and select the option that gives the minimum energy. In simplified terms, this pattern of
calculation performs minimization over several possible values. These types of calculations
are easily done in parallel by assigning equal-sized chunks of minimization work to all threads,
collecting the results, and taking the global minimum over all values.
The speedup algorithm for internal loop calculations can be parallelized only for special
cases. The computations of general internal loops using the extension principle are inherently
sequential. However the generally adopted heuristic option for internal loops has the general
minimization pattern, is easily parallelizable, and uses two nested for loops which results in
a complexity of O(k 2 ) for a particular (i, j). Multiloop calculations also follow the general
minimization pattern for the V M and W M arrays, have O(n) time complexity for an element
(i, j), and are amenable to parallelization as well.
5.3
Implementation Details
We use OpenMP [13] to implement shared memory parallelism. All the subsequent diagonals are considered with the upper for loop and parallelism is implemented by applying
11
an OpenMP for loop pragma over the inner for loop to parallelize the computation on the
diagonal in consideration as shown in Algorithm 1. The guided scheduling strategy works
best for this parallelization. This is because there may not be equal amounts of work for
every point on the diagonal. If the bases i and j are not able to make a pair then it is
not necessary to compute for the whole calculation. In this case, for the heuristic option
of internal loop calculations, only W M(i, j) needs to be calculated. And for the speedup
algorithm V BI(i, j) also needs to be calculated with W M(i, j).
We explore the function level parallelism for the last few diagonals by deciding a threshold
variable A, such that the parallelism is implemented at the higher level for the diagonals up
to j − i = A and implement only the function level parallelism starting from the diagonal
j − i = A + 1 to j − i = N − 1. This facilitates the use of more threads to exploit more
parallelism at the time when there are not enough points on the diagonals. However this
technique did not give us a performance advantage.
Cache locality
For this algorithm the ratio of computation to the memory accesses is low. Energy of a
secondary structure is calculated by adding up the energies of various loops present in the
structure. Energy of a sequence is the sum of various energy terms of which some are read
directly from the energy tables and others are calculated by the program. Therefore, large
cache sizes and locality in reference for accessing various data elements play an important
role in reducing the running time of GTfold. Computing the elements row-wise or columnwise as described in Section 5.2 provides better cache locality than computing the elements
on the subsequent diagonals. However, these two ways are inherently sequential.
6
Experimental Results
We have performed several experiments to establish that GTfold runs faster than competing
folding programs such as mfold and RNAfold and achieves accuracy comparable with them.
For the running time and accuracy comparisons, we are using RNAfold distributed with
Vienna RNA Package version 1.7.2 and UNAFold version 3.6, which supersedes mfold.
6.1
Energy and Structure Comparison
To establish the accuracy of GTfold, we compare the structures obtained from GTfold,
mfold, and RNAfold, with the correct structures determined with the more reliable method
of comparative sequence analysis [6, 7]. Please note that the comparative sequence analysis
method requires large data sets for the prediction of secondary structures, and therefore, its
application is limited with the availability of the required datasets.
Doshi et al. [4] take a phylogenetically diverse dataset of ribosomal RNA sequences and
compare the optimal secondary structures predicted using mfold 2.3 and mfold 3.1 with
the correct structures. Here we are using the 16S and 23S ribosomal RNA sequences from
12
Figure 1 and Table 4 of their study [4] for accuracy comparisons and are taken from the
Gutell database [2]. We provide a GenBank accession number for each of these sequences.
For predicting structures with UNAFold and RNAfold their command line default options
are used. Accuracy of the structures is calculated in the same manner as in [4] with one
difference. We include non-canonical base pairs from the comparison instead of excluding
them. This affects the accuracy of all three programs in the same manner because none of
them is able to predict non-canonical base pairs due to the lack of energy parameters for
them. The accuracy is calculated as a percentage of correctly predicted base pairs which are
also present in the correct secondary structure.
Table 1: Free energy (in Kcal/mole) comparison of GTfold, UNAFold and RNAfold for 16S
rRNA sequences
Sequence
Length GTfold UNAFold RNAfold
X00794
1962 -741.90
-722.70
-746.60
X54253
701 -149.00
-141.30
-149.03
X54252
697 -142.50
-137.50
-142.52
Z17224
1550 -564.80
-549.10
-565.12
X65063
1432 -582.00
-570.80
-581.94
X52949
1452 -802.70
-794.50
-804.40
X98467
1295 -487.00
-460.00
-489.31
Y00266/M24612
1244 -325.60
-317.30
-328.80
K00421
1474 -687.00
-682.10
-687.01
Table 2: Accuracy comparison (in percent) of GTfold, UNAFold and RNAfold for 16S rRNA
sequences of Table 1
Sequence
GTfold UNAFold RNAfold
X00794
30.33
31.65
27.91
X54253
25.67
20.32
25.13
X54252
21.16
21.64
21.16
Z17224
26.03
24.57
24.57
X65063
24.09
22.02
23.83
X52949
15.07
16.08
15.07
X98467
15.56
10.97
16.33
Y00266/M24612
19.19
17.30
18.11
K00421
76.42
75.76
76.42
Table 1 shows the optimal free energy of various 16S ribosomal RNA sequences predicted
with GTfold, UNAFold and RNAfold. Table 2 shows the accuracy comparison for the three
programs for the sequences of Table 1. Similarly Table 3 shows the optimal free energy
obtained using the programs for 23S ribosomal RNA sequences, and Table 4 shows the
accuracy comparison for the sequences of Table 3.
13
Table 3: Free energy (in Kcal/mole) comparison of GTfold, UNAFold and RNAfold for 23S
rRNA sequences
Sequence Length GTfold UNAFold RNAfold
X14386
3105 -791.30
-775.10
-792.65
X54252
953 -180.20
-173.50
-179.71
X52392
1621 -395.80
-389.70
-397.85
J01527
3273 -700.60
-684.60
-702.87
K01868
3514 -1328.50
-1294.30 -1333.95
X53361
4052 -1692.90
-1665.60 -1696.49
X52949
2850 -1707.90
-1689.20 -1709.80
M67497
3029 -1661.10
-1647.10 -1668.12
Table 4: Accuracy comparison (in percent) of GTfold, UNAFold and RNAfold for 23S rRNA
sequences of Table 3
Sequence GTfold UNAFold RNAfold
X14386
21.77
18.79
18.32
X54252
23.74
21.92
23.29
X52392
24.44
25.56
24.16
J01527
24.72
30.11
25.14
K01868
20.67
22.01
17.85
X53361
21.54
16.59
15.44
X52949
34.44
31.24
26.69
M67497
64.05
63.6
63.94
The small differences in the optimal free energy scores predicted using the three programs
are due to the algorithmic issues and policies related to thermodynamic aspects. Energy
comparisons shown in Tables 1 and 4 demonstrate that our energy scores for the various
sequences lie in the very small range of these standard programs. Accuracy percentages
shown in Tables 2 and 3 establish that GTfold achieves accuracy comparable with UNAFold
and RNAfold for the diverse dataset chosen. Accuracy comparisons for various ribosomal
sequences show that in general the accuracy of the prediction programs are very low. The
prediction accuracy is expected to improve with the inclusion of advanced thermodynamic
details that are not presently incorporated due to high computational cost. The development
of GTfold facilitates the implementation of these improvements.
6.2
Running Time Comparison
GTfold implements parallelism for shared memory multiprocessor and multicore systems.
Running time experiments are performed on an IBM P5-570 server with 16 dual core 1.9
GHz CPUs and 256 GB of main memory with L2 cache of 1.9 MB per CPU. GTfold is
compiled with IBM xlC compiler Enterprise Edition 7.0, with -q64 option for the 64 bit com14
pilation, -O3 level of optimization and -qsmp=omp option for OpenMP support. RNAfold
and UNAFold are compiled with their default compiler and compilation options. An additional flag -maix64 is set while compiling UNAFold and RNAfold due to the runtime memory
limitations on the system for 32-bit compilations.
Figure 6: Comparison of running times for predicting the RNA secondary structures of 11
picornaviral sequences. The sequences are arranged in increasing order of length from 7124
to 8214 nucleotides.
Palmenberg and Sgro in 1997 [14] investigated the optimal and suboptimal secondary
structures of 11 picornaviral RNA sequences using mfold version 2.2. The length of the
sequences varies from 7124 to 8214 nucleotides. They report that each sequence required
5-7 days of CPU time using a modern workstation so that all 11 sequences took 2 to 3
months of time. In stark comparison, GTfold finishes the execution of this set of sequences
in approximately 8 minutes using 32 threads.
In Figure 6 we compare the running time of GTfold with 32 threads, UNAFold, and
15
Figure 7: Comparison of running times for predicting the RNA secondary structure of the
HIV-1 virus. The dashed horizontal lines represent the sequential running time of UNAFold
and RNAfold.
16
RNAfold, for all the picornaviral sequences and using the same machine. We can see that
GTfold runs one to two orders of magnitude faster than the standard the sequential programs
UNAFold and RNAfold.
Figure 7 compares the running time of GTfold, UNAFold, and RNAfold, for an HIV-1
sequence (accession number Z11530) with 9,781 nucleotides. The secondary structure predicted with GTfold of the viral sequence is shown in Figure 1. All three programs implement
the heuristic option and limit the internal loop size to 30. It is clear from the graph that GTfold with one thread performs much better than UNAfold and is comparable with RNAfold.
The running time of GTfold decreases with the increasing number of processors. Even with
two threads GTfold runs 2.06 times faster than RNAfold. GTfold folds the entire HIV viral
sequence in 84 seconds with 32 threads in comparison to UNAFold and RNAfold which take
approximately 2.4 hours and 27 minutes, respectively.
Figure 8: GTfold running time statistics for a Homo sapiens 23S ribosomal RNA sequence
with accession number J01866/M11167 using the Internal Loop Speedup Algorithm
17
GTfold also implements an exact algorithm for finding the optimal internal loops called
Internal Loop Speedup Algorithm (ILSA). Though internal loops with sizes longer than 30
are observed, they are usually rare. ILSA can catch these exceptional cases occurring with
rarity in nature. It is far more expensive to run this algorithm than the commonly used
heuristic. Figure 8 shows the running time of GTfold with the varying number of processors
for a 5,184 length 23S Ribosomal RNA sequence of Homo sapiens with accession number
J01866/M11167. GTfold is able to reduce the running time from 512 minutes (approximately
9 hours) to 21.5 minutes by using 32 threads. This way, we show that optimized algorithms
such as internal loop speedup algorithm can be executed with GTfold in an affordable time.
Please note that UNAFold and RNAfold do not implement the ILSA algorithm and can miss
the rare possibilities.
Figure 9 shows the speedup achieved using 2, 4, 8, 16, and 32, threads for GTfold
using the internal loop speedup algorithm (ILSA) option for the sequence with accession
number J01866/M11167 and the heuristic options with an HIV viral sequence. The maximum
speedup achieved in the first and second cases is approximately 23.8 and 19.8, respectively.
We have achieved slightly superlinear speedups for 2, 4, and 8, threads in the case of the
heuristic option due to the better cache locality when the number of processors is more than
one.
7
Conclusions
We have developed GTfold, a parallel and multicore code for predicting RNA secondary
structures that achieves 19.8 fold speedups over the current best sequential program for
large, important RNA sequences and has accuracy comparable to the existing standard
folding programs. We have recognized the computational requirements of the problem and
implemented shared memory parallelism for the problem. Development of GTfold solves
the problem of prohibitive running time factor for large RNA sequences and will help the
molecular biology community to work towards the problem of predicting more accurate
secondary structures. We have reduced the running time from approximately 2 months to 8
minutes for folding 11 picornaviral sequences.
18
Figure 9: Speedups obtained by GTfold with the heuristic algorithm for the HIV-1 virus and
with the internal loop speedup algorithm for the Homo sapiens ribosomal RNA sequence.
Speedup is with respect to GTfold running on one processor for each series.
19
8
Acknowledgments
This work was supported in part by National Institutes of Health grant NIH NIGMS R01
GM083621, and by NSF Grants CNS-0614915 and DBI-04-20513. Additionally, the research
of Christine E. Heitsch, Ph.D., is supported in part by a Career Award at the Scientific Interface (CASI) from the Burroughs Wellcome Fund (BWF). We acknowledge visiting students
Gregory Nou, Sonny Hernández, and Manoj Soni, who helped with the implementation of
GTfold.
References
[1] M. Andronescu, R. Aguirre-Hernandez, A. Condon, and H.H. Hoos. RNAsoft: a suite of
RNA secondary structure prediction and design software tools. Nucleic Acids Research,
31(13):3416–3422, 2003.
[2] J.J. Cannone, S. Subramanian, M.N. Schnare, J.R. Collett, L.M. D’Souza, Y. Du,
B. Feng, N. Lin, L.V. Madabusi, K.M. Müller, N. Pande, Z. Shang, N. Yu, and R.R.
Gutell. The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC
Bioinformatics, 3(1), 2002.
[3] K. Clyde and E. Harris. RNA secondary structure in the coding region of dengue virus
type 2 directs translation start codon selection and is required for viral replication. J.
Virol., 80(5):2170–2082, 2006.
[4] K.J. Doshi, J.J. Cannone, C.W. Cobaugh, and R.R. Gutell. Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA
secondary structure prediction. BMC Bioinformatics, 2004.
[5] M. Fekete, I.L. Hofacker, and P.F. Stadler. Prediction of RNA base pairing probabilities
on massively parallel computers. J. Computational Biology, 7(1-2):171–182, 2000.
[6] P.P. Gardner and R. Giegerich. A comprehensive comparison of comparative RNA
structure prediction approaches. BMC Bioinformatics, 5(140), 2004.
[7] R.R. Gutell, J.C. Lee, and J.J. Cannone. The accuracy of ribosomal RNA comparative
structure models. Curr. Opin. Struct. Biol., 12(3):301–310, 2002.
[8] I.L. Hofacker, M.A. Huynen, P.F. Stadler, and P.E. Stolorz. Knowledge discovery in
RNA sequence families of HIV using scalable computers. In Proc. of the 2nd Int’l Conf.
on Knowledge Discovery and Data Mining, Portland, OR, 1996.
[9] I.L. Hofacker, P.F. Stadler, L.S. Bonhoeffer, M. Tacker, and P. Schuster. Fast folding
and comparison of RNA secondary structures. Monatshefte für Chemie, 125:167–188,
1994.
20
[10] R.B. Lyngsø, M. Zuker, and C.N.S. Pedersen. Internal loops in RNA secondary structure
prediction. In Proc. of the 3rd Ann. Int’l Conf. on Computational Molecular Biology
(RECOMB), pages 260–267, 1999.
[11] D.H. Mathews, J. Sabina, M. Zuker, and D.H. Turner. Expanded sequence dependence
of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol.
Biol., 288:911–940, 1999.
[12] A. Nakaya, K. Yamamoto, and A. Yonezawa. RNA secondary structure prediction using
highly parallel computers. Bioinformatics, 11(6):685–692, 1995.
[13] OpenMP Architecture Review Board. OpenMP Application Program Interface, 3.0 edition, May 2008.
[14] A.C. Palmenberg and J.-Y. Sgro. Topological organization of picornaviral genomes:
Statistical prediction of RNA structural signals. Sem. Virol., 8:231–241, 1997.
[15] M. Taufer, T. Solorio, A. Licon, D. Mireles, and M.-Y. Leung. On the effectiveness of
rebuilding RNA secondary structures from sequence chunks. In 7th IEEE Int’l Workshop
on High Performance Computational Biology (HiCOMB), pages 1–8, 2008.
[16] M. Weik, J. Modrof, H-D Klenk, S. Becker, and E. Mühlberger. Ebola virus VP30mediated transcription is regulated by RNA secondary structure formation. J. Virol.,
76(17):8532–8539, 2002.
[17] E.M. Westerhout, M. Ooms, M. Vink, A.T. Das, and B. Berkhout. HIV-1 can escape
from RNA interference by evolving an alternative structure in its RNA genome. Nucleic
Acids Research, 33(2):796–804, 2005.
[18] S. Wuchty, W. Fontana, I.L. Hofacker, and P. Schuster. Complete suboptimal folding
of RNA and the stability of secondary structures. Biopolymers, 49(2):145–165, 1999.
[19] W. Zhou and D.K. Lowenthal. A parallel, out-of-core algorithm for RNA secondary
structure prediction. In Int’l Conf. on Parallel Processing (ICPP), pages 74–81, 2006.
[20] M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic
Acids Research, 31(13):3406–3415, 2003.
[21] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using
thermodynamic and auxiliary information. Nucleic Acids Research, 9(1):133–148, 1981.
21
A
Pseudocode
The algorithm of predicting the MFE structures can be divided into two steps. The first
part is to find the optimal free energy score that can be achieved by a possible secondary
structure of the sequence. The second step is called traceback in which we trace in backward
direction to find the structure that corresponds to the MFE score. Note that there may
be one or more secondary structures corresponding to this score based upon the sequence
contents. In this section we provide the pseudocode of the algorithm for the fill step. The
algorithm for optimal and complete suboptimal traceback has been described by Wuchty et
al. [18]. The fill step algorithm is a dynamic programming algorithm for predicting nested
RNA secondary structures given by Zuker and Stiegler in [21].
Minimization formulas which define the algorithm, are recursive in nature; therefore the
order of computation of various functions for a particular element (i, j) is important. For example W M(i, j) uses the value of V (i, j), so calcW M(i, j) should be called after calcV (i, j).
Similarly, calcV BI(i, j) and calcV M(i, j) should be called before calcV (i, j). calculate(N)
function call calculates the MFE score for the whole sequence, where N is the sequence length.
Computation of other functions calcV BI(i, j), calcW M(i, j), calcV M(i, j), calcV (i, j) and
calcW (j) is performed as shown in algorithms 3, 4, 5, 6 and 7 respectively. Also pseudocode
for internal loop calculations using the heuristic option is given algorithm 8 and using the
speedup algorithm is given in algorithms 9, 10 and 11.
22
input : Sequence of Length N
output: Optimal Energy of the sequence
begin
readEnergyTables();
readSequence();
for j ← 1 to N do
for i ← 1 to N do
V BI(i, j) ← INF ;
V M(i, j) ← INF ;
V (i, j) ← INF ;
W M(i, j) ← INF ;
end
W (j) = INF ;
end
for b ← 4 to N do
for i ← 1 to N − b do
j ← i + b;
calcVBI(i, j);
calcVM(i, j);
calcV(i, j);
calcWM(i, j);
end
calcW(b + 1);
end
return W (N);
end
Algorithm 2: Calculate( int N )
input : Base indices i and j
output: V BI(i, j)
begin
for ip ← i + 1 to j − 2 do
for jp ← ip + 1 to j − 1 & jp > ip do
V BI(i, j) ← MIN( V BI(i, j), eL(i, j, ip , jp ) + V (ip , jp ));
end
end
return V BI(i, j);
end
Algorithm 3: Naive calcV BI(i, j)
23
input : base indices i and j
output: W M(i, j)
begin
W Mij ← V (i, j)+auPen(i, j)+b;
W Midj ← V (i + 1, j)+dangle-3’(j, i + 1, i)+auPen(i + 1, j)+b + c;
W Mijd ← V (i, j − 1)+dangle-5’(j − 1, i, j) + auPen(i, j − 1)+b + c;
W Midjd ← V (i + 1, j − 1)+dangle-3’(j − 1, i + 1, i)
+dangle-5’(j − 1, i + 1, j) + auPen(i + 1, j − 1)+b + 2c;
W M(i, j) ← MIN(W Mij , W Midj , W Mijd , W Midjd );
for h ← i to j − 1 do
W M(i, j) ← MIN(W M(i, j), W M(i, h) + W M(h + 1, j));
end
W M(i, j) ← MIN( W M(i + 1, j) + c, W M(i, j − 1) + c, W M(i, j)) ;
return W M(i, j);
end
Algorithm 4: calcW M(i, j)
24
input : base indices i and j
output: V M(i, j)
begin
V Mij = V Midj = V Mijd = V Midjd = INF ;
// a, b, c are multiloop offset, helix penalty and free base
penalty.
for h ← i + 2 to j − 1 do
V Mij ← MIN( V Mij , W M(i + 1, h − 1) + W M(h, j − 1));
end
for h ← i + 3 to j − 1 do
V Midj ← MIN(V Midj , W M(i + 2, h − 1) + W M(h, j − 1));
end
V Midj ← V Midj +dangle-5’(i, j, i + 1)+c;
for h ← i + 2 to j − 2 do
V Mijd ← MIN(V Mijd , W M(i + 1, h − 1) + W M(h, j − 2));
end
V Mijd ← V Mijd + dangle-3’(i, j, j − 1) + c;
for h ← i + 3 to j − 2 do
V Midjd ← MIN(V Midjd , W M(i + 2, h − 1) + W M(h, j − 2));
end
V Midjd ← V Midjd + dangle-5’(i, j, i + 1) + dangle-3’(i, j, j − 1) +2c;
V M(i, j) ← MIN( V Mij , V Midj , V Mijd , V Midjd );
V M(i, j) ← V M(i, j) + a + b + auPen(i, j);
return V M(i, j);
end
Algorithm 5: calcV M(i, j)
input : base indices i and j
output: V (i, j)
begin
V (i, j) ← MIN( eH(i, j), eS(i, j) +V (i + 1, j − 1), V BI(i, j), V M(i, j));
return V (i, j);
end
Algorithm 6: calcV (i, j)
25
input : base index j
output: Optimal Energy of the sequence from 1 to j, W (j)
begin
for i ← 1 to j − 1 do
wim1 ← MIN(0, W (i − 1))
Wij ← V (i, j)+auPen(i, j)+wim1;
Widj ← V (i + 1, j)+dangle-3’(j, i + 1, i)+auPen(i + 1, j)+wim1;
Wijd ← V (i, j − 1)+dangle-5’(j − 1, i, j)+auPen(i, j − 1)+wim1;
Widjd ← V (i + 1, j − 1)+dangle-3’(j − 1, i + 1, i)+dangle-5’(j − 1, i +
1, j)+auPen(i + 1, j − 1)+wim1;
W (j) ← MIN(W (j), Wij , Widj , Wijd , Widjd );
end
W (j)=MIN( W (j), W (j − 1));
return W (j);
end
Algorithm 7: calcW (j)
input : Base indices i and j
output: V BI(i, j)
begin
maxloop ← 30;
for ip ← i + 1 to i + maxloop + 1 do
for jp ← j − maxloop − 1 + (ip − i − 1) to j − 1 & jp > ip do
V BI(i, j) ← MIN( V BI(i, j), eL(i, j, ip , jp ) + V (ip , jp ));
end
end
return V BI(i, j);
end
Algorithm 8: calcV BI(i, j) using the heuristic
26
input : Indices i and j
output: V BI(i, j)
begin
c ← 3;
// bases ip and jp forms the interior base pair
// Case 1: When first side < c
for ip ← i + 1 to i + c do
for jp ← ip + 1 to j − 1 do
V BI(i, j) ← MIN(V BI(i, j),eL(i, j, ip , jp )+ V (ip , jp ));
end
end
// Case 2: When first side >= c, and second side < c
for ip ← i + c + 1 to j − 2 do
for jp ← j − c to j − 1 & jp > ip do
V BI(i, j) ← MIN(V BI(i, j),eL(i, j, ip , jp ) + V (ip , jp ));
end
end
// case 3: General Cases - includes three subcases
// case 3.1: When both sides=c
ip = i + c + 1;
jp = j − c − 1;
V BI(i, j) ← MIN(V BI(i, j), eL(i, j, ip , jp )+ V (ip , jp ));
extend1(i, j, ip , jp );
// case 3.2: When First side=c+1, second side=c
ip = i + c + 2;
jp = j − c − 1;
E1 ← eL(i, j, ip , jp )+ V (ip , jp );
// subcase 3.3: When First side=c, second side=c+1
ip1 = i + c + 1;
jp1 = j − c − 2;
E2 ← eL(i, j, ip1 , jp1 )+ V (ip1 , jp1 );
V BI(i, j) ← MIN(V BI(i, j), E1 , E2 );
if E1 > E2 then
ip = i + c + 1;
jp = j − c − 2;
end
extend2(i, j, ip , jp );
return V BI(i, j)
end
Algorithm 9: calcV BI(i, j) using the internal Loop Speedup Algorithm
27
input: Variable i, j, ip , jp
begin
iv ← i + c + 1;
jv ← j − c − 1;
for b ← 1 to MIN(i − 1,N − j) do
V BI(i − b, j + b) ← MIN(V BI(i − b, j + b), eL(i, j, ip , jp )+ V (ip , jp ));
// Two more options
if V BI(i − b, j + b) > eL(i, j, iv + b, jv + b) + V (iv + b, jv + b) then
ip ← iv + b;
jp ← jv + b;
V BI(i − b, j + b) ← eL(i, j, ip , jp )+ V (ip , jp ) ;
end
if V BI(i − b, j + b) > eL(i, j, iv − b, jv − b)+ V (iv − b, jv − b) then
ip ← iv − b;
jp ← jv − b;
V BI(i − b, j + b) ← eL(i, j, ip , jp ) + V (ip , jp ) ;
end
end
end
Algorithm 10: extend1(i, j, ip , jp )
input: Variable i, j, ip , jp
begin
iv ← i + c + 1;
jv ← j − c − 1;
for b ← 1 to MIN(i − 1,N − j) do
V BI(i − b, j + b) ← MIN(V BI(i − b, j + b), eL(i, j, ip , jp )+ V (ip , jp ));
// Two more options
if V BI(i − b, j + b) > eL(i, j, iv + b + 1, jv + b) + V (iv + b + 1, jv + b) then
ip ← iv + b + 1;
jp ← jv + b;
V BI(i − b, j + b) ← eL(i, j, ip , jp )+ V (ip , jp )
end
if V BI(i − b, j + b) > eL(i, j, iv − b, jv − b − 1)+ V (iv − b, jv − b − 1) then
ip ← iv − b;
jp ← jv − b − 1;
V BI(i − b, j + b) ← eL(i, j, ip , jp ) + V (ip , jp );
end
end
end
Algorithm 11: extend2(i, j, ip , jp )
28
Download