Fast Folding and Comparison of RNA Secondary Structures

advertisement
Fast Folding and Comparison of
RNA Secondary Structures
Ivo L. Hofacker
Walter Fontana
Peter F. Stadler
L. Sebastian Bonhoeffer
Manfred Tacker
SFI WORKING PAPER: 1993-07-044
SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the
views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or
proceedings volumes, but not papers that have already appeared in print. Except for papers by our external
faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or
funded by an SFI grant.
©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure
timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights
therein are maintained by the author(s). It is understood that all persons copying this information will
adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only
with the explicit permission of the copyright holder.
www.santafe.edu
SANTA FE INSTITUTE
Fast Folding and Comparison of RNA
Secondary Structures
(The Vienna RNA Package)
By
Ivo L.
L.
HOFACKER a ,*, WALTER FONTANA c , PETER
F.
STADLERa,c,
SEBASTIAN BONHOEFFERd , MANFRED TACKER a AND
PETER SCHUSTER a,b,c
aInstitut fur Theoretische Chemie der Universitiit Wien
A-lOgO Wien, Austria
bInstitut fur Molekulare Biotechnologie
D-07745 Jena, Germany
CSanta Fe Institute
Santa Fe, NM 87501, U.S.A.
dDepartment of Zoology, University of Oxford
South Parks Road, Oxford OX1 3PS, UK
*Mailing Address:
Institut fur Theoretische Chemie der Universitiit Wien
WiihringerstraBe 17, A-lOgO Vienna, Austria
Email: ivo~otter.itc.univie.ac.at
Phone: **43 1 40 480 678, Fax: **43 1 40 28 525
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
Abstract
Computer codes for computation and comparison of RNA secondary
structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with
minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities.
An efficient heuristic for the inverse folding problem of RNA is introduced.
In addition we present compact and efficient programs for the comparison of
RNA secondary structures based on tree editing and alignment.
All computer codes are written in ANSI C. They include implementations of
modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences
are.
1. Introduction
Recent interest in RNA structures and functions was caused by their
catalytic capacities (Cech & Bass 1986, Symons 1992) as well as by the
success of selection methods in producing RNA molecules with perfectly taylored properties (Beaudry & Joyce 1992, Ellington & Szostak 1990). Conventional structure analysis concentrates on natural molecules and closely related
variants which are accessible by site directed mutagenesis. Several current
projects are much more ambitious (particularly encouraged by ready availability of random RNA sequences) and aim at the exploration of sequencestructure relations in full generality (Fontana et ai. 1991, 1993a,b). The
new approach turned out to be successful on the level of RNA secondary
structures. In order to be able to do proper statistics millions of structures
derived from arbitrary sequences have to be analyzed. In addition folding of
long sequences becomes more and more important as well. Both tasks call
for fast and efficient folding algorithms available on conventional sequential
-1-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
computers as well as on parallel machines. The need arises to compare the
performance of sequential and parallel implementations in order to provide
information for the conception of optimal strategies for given tasks.
The inverse folding problem is one of several new issues brought up by
recent developments in rational design of RNA molecules: given an RNA secondary structure, which are the RNA sequences that form this structure as
a minimum free energy structure. The information about many such "structurally neutral" sequences is the basis for tayloring RNA molecules which
are suitable candidates for multi-functional molecules. More and more sequence data becoming currently available call for efficient comparisons either
directly or on the level of their minimum free energy structures. Conventional
alignment techniques are supplemented by new approaches like statistical geometry (Eigen et al. 1988) and split decomposition (Bandelt & Dress 1992).
In this paper we introduce a package for computation, comparison and
analysis of RNA secondary structures and properties derived from them, the
Vienna RNA Package. The core of the package consists of compact codes to
compute either minimum free energy structures (Zuker & Stiegler 1981, Zuker
& Sankoff 1984) or partition functions of RNA molecules (McCaskill 1990).
Both use the idea of dynamic programming originally applied by Waterman
(Waterman 1978, Waterman & Smith 1978, Nussinov & Jacobson 1980).
Non-thermodynamic criteria of structure formation like maximum matching
(the maximal number of base pairs; Nussinov et al. 1978) or various versions
of kinetic folding (Martinez 1984) can be applied as alternative options. An
inverse folding heuristic is implemented to determine sets of structurally neutral sequences. A statistics package is included which contains routines for
cluster analysis, statistical geometry, and split decomposition. This core is
now avaliable as library as well as a set of stand alone routines.
-2-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
In a forthcoming version the package will include routines for secondary
structure statistics (Fontana et al. 1993b), statistical analysis of RNA folding
landscapes as well as sequence-structure maps (Schuster et al. 1993). Further
options will be available for RNA melting kinetics, in particular for the computation of melting curves and their first derivatives (Tacker et al. 1993).
Extensions of the package provide access to computer codes for optimization of RNA secondary structrues according to predefined criteria, as well as
simulations of molecular evolution experiments in flow reactors (Fontana &
Schuster 1987, Fontana et al. 1989).
In section 2 we present the core codes for folding as well as some 1/0routines that can be used for stand-alone applications of the folding programs. Section 3 introduces a variant of the folding program which is suitable
for implementation on a parallel computer with hypercube architecture. Section 4 is dealing with the inverse folding problem. Section 5 decribes codes
for comparing RNA secondary structures as well as base-pairing matrices derived from partition functions. Basic to our routines is a tree representation
of RNA secondary structures introduced previously (Fontana et al. 1993b).
Some examples of selected applications of the Vienna RNA package are given
in section 6.
2. RNA Folding Programs
A secondary structure on a sequence is a list of base pairs i, j with i < j
such that for any two base pairs i,j and k, I with i ::; k holds:
i=k
{=}
j=l
k<j
===> i<k<l<j
(1)
The first condition implies that each nucleotide can take part in not more
that one base pair, the second condition forbids knots and pseudoknots. The
latter restriction is necessary for dynamic programming algorithms. A base
-3-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
pair k,l is interior to the base pair i,j, if i < k < I < j. It is immediately
interior if there is no base pair p, q such that i < p < k < I < q < j. For each
base pair i, j the corresponding loop is defined as consisting of i, j itself, the
base pairs immediately interior to i, j and all unpaired regions connecting
these base pairs. The energy of the secondary structure is assumed to be the
sum of the energy contributions of all loops. (Note that a stacked basepair
constitutes a loop of zero size.) As a consequence of the additivity of the
energy contributions, the minimum free energy can be calculated recursively
by dynamic programming (Waterman 1978, Waterman & Smith 1978, Zuker
& Stiegler 1981, Zuker & Sankoff 1984).
Experimental energy parameters are available for the contribution of an
individual loop as functions of its size, of the type of its delimiting basepairs,
and partly of the sequence of the unpaired strains. These are usually measured for T = 37°C and 1M sodium chloride solutions (Freier et al. 1986,
Jaeger et al. 1989). For the base pair stacking the enthalpic and entropic
contributions are known separately. Contributions from all other loop types
are assumed to be purely entropic. This allows to compute the temperature
dependence of the free energy contributions:
.6.Gstac k = .6.H37 ,stack
-
T .6.S37 ,stack
.6.G\oop = - T .6.S37 ,loop
(2)
We use a recent version of the parameter set published by Freier et al. (1986),
which was supplied in an updated version by Danielle Konings. In the current
implementation we do not consider dangling ends. The essential part of the
energy minimization algorithm is shown in table 1.
The structure (list of base pairs) leading to the minimum energy is usually retrieved later on by "backtracking" through the energy arrays.
-4-
uu_~
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
Table 1: Pseudo Code of the minimum free energy folding algorithm.
for(d=1. .. n)
for(i=1. .. d)
j=i+d
C[i ,j] = MIN(
Hairpin(i,j),
MIN( i<p<q<j : Interior(i,j;p,q)+C[p,q] ),
MIN( i<k<j : FM[i+1,k]+FM[k+1,j-1]+cc )
)
F[i,j] = MIN( C[i,j], MIN(i<k<j : F[i,k]+F[k+1,j]))
FM[i,j]= MIN( C[i,j]+ci, FM[i+1,j]+cu, FM[i,j-1]+cu,
MIN( i<k<j : FM[i,k]+FM[k+1,j] )
)
free_energy = F[1,n]
Remark. F [i ,j] denotes the minimum energy for the subsequence consisting of bases i
through j. C[i,j] is the energy given that i and j. pair. The array FM is introduced for
handling multiloops. The energy parameters for all loop types except for multiloops are
formally subsumed in the function Int~rior(i,j jp,q) denoting the energy contribution
of a loop closed by the two base pairs i-j and p-q. We have assumed that multi-loops
have energy contribution F=cc+ci*I+cu*U, where I is the number of interior base pairs
and U is the number of unpaired digits of the loop. The time complexity here is O(n 4 ). It
is reduced to O(n 3 ) by restricting the size of interior loops to some constant, say 30.
The partition function for the ensemble of all possible secondary structures can be calculated analogously
6.G(S)
Q=
e
RT
(3)
all structures S
(McCaskill 1990). A pseudocode is given in table 2.
Clearly the algorithm of table 2 does not predict a secondary structure,
but one can calculate the probability P k1 for the formation of a base pair
(4)
p ..
'J
Ecc· Eci
i,i
i<k<l<i
b
Q kl
i,i
i<k<l<i
[E
cu k-i-1Qm
l+l,j-l
+ Qmi+l,k-l ECU j-l-l + Qmi+l,k-l Qml+l,j-l ] ,
where the symbols are defined in the caption of table 2. In this case
the backtracking has time complexity O(n 3 ) just as the calculation of the
partition function itself.
-5-
_
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
Table 2: Pseudocode for the calculation of the partition function.
for(d=1 ... n)
for(i=1. .. d)
j=i+d
QB[i,j] = EHairpin(i,j) +
SUM( i<p<q<j : Elnterior(i,j;p,q)*QB[p,q] ) +
SUM( i<k<j : QM[i+1,k-1]*QM1[k,j-1]*Ecc )
QM[i,j] =
SUM( i<k<j : (Ecu·(k-i)+QM[i,k-1])*QM1[k,j] )
QM1[i,j]= SUM( i<k<=j : QB[i,k]*Ecu·(j-k)*Eci )
Q[i,j] = 1 + QB[i,jJ +
SUM( i<p<q<j : Q[i,p-1]*QB[p,q] )
partition_function = Q[1,n]
Remark. Here Ex:=exp( -x / RT) denotes the Boltzmann weights corresponding to the
energy contribution x. Q[i,j] denotes the partition function Qij of the subsequence i
through j. The array QM contains the partition function Q~j of the subsequence subject
to the fact that i and j form a base pair. QM and QM1 are used for handling the multiloop
contributions. x y means
x'.
For details see McCaskill (1990).
Both folding algorithms have been integrated into a single interactive
program including postscript output of the minimum energy structure and
the base pairing probability matrix.
The program requires 6 n 2 bytes of memory for the minimum energy
fold and 10 n 2 bytes for the calculation of the partition function on machines
with 32 bit integers and single precision floating points. In order to overcome
overflows for longer sequences we rescale the partition function of a subsequence of length £ by a factor (lin, where
Qis a rough estimate of the order
of magnitude of the partition functions:
- 37.))
Q- -_ exp (-184.3 + 7.27(T
RT
The performance of the algorithms reported here
(5)
IS
compared with
Zuker's (Zuker 1989) more recent program mfold 2.0 (available via anonymous ftp from nrcbsa.bio.nrc.ca) which computes suboptimal structures
together with the minimum free energy structure in table 3. The computation
of the minimum free energy structure including the entire matrix of base pairing probabilities is considerably faster with the present package (although we
-6-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
do not provide information on individual suboptimal structures). Secondary
structures are represented by a string of dots and matching parentheses,
where dots symbolize unpaired bases and matching parentheses symbolize
base pairs. An example is seen in the sample session shown in in figure 1.
Table 3: Performance of implementations of folding algorithms. CPU time
is measured on a SUN SPARC 2 Workstation with 32M RAM.
Data are for random sequences.
n
100
200
300
500
690
CPU time per folding [5]
mfold 2.0
RNAfold 1.0
MFE
MFE+PF
2.0
6.2
24.5
10.9
34.7
129.6
32.2
97.4
354.4
1258.3
96.6
312.4
228.1
743.9
3105.1
Because of the simplifications in the energy model and the uncertainties
in the energy parameters predictions are not always as accurate as one would
like. It is, therefore, desirable to include additional structural information
from phylogenetic or chemical data.
Our minimum free energy algorithm allows to include a variety of constraints into the secondary structure prediction by assigning bonus energies
to structures honoring the constraints. One may enforce certain base pairs
or prevent bases from pairing. Additionally, our algorithm can deal with
bases that have to pair with an unknown pairing partner. A sample session
is described in figure 2.
-7-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
tram> RIAfold -T 42 -pi
Input string (upper or lo~er case); G to quit
......... 1
2
3
4
UUGGAGUACACAACCUGUACACUCUUUC
length
28
5
6
7
.
=
UUGGAGUACACAACCUGUACACUCUUUC
.. CCCCC .. CCC ... ») .. »») ...
minimum free energy = -3.71
.. CCCCC[[C" ... ») .. »») ...
free energy of ensemble = -4.39
frequency of mfe structure in ensemble 0.337231
o
u.u.G.G.~.O.UAC.A.C.A.~.C.C.u.G.U.A.c.~.c.u.c.u.u.u.c.c::
g: .. :: ::.::.::.:: : :.:.:: : :';)4'-:':'::.:.:.:~
0::: .. :::: ::::: :.:::: :.: :.'. ...
'.':':':0>
o' • • • • • • . . . . • '.' •• '.' '.'
. ········0
::;.' . • . .. . •••••..•••.• '.'
••••• 'c
.(
rna.ps
dot.ps
Figure 1: Interactive example run of RNAfold for a random sequence. When the base
pairing probability matrix is calculated the symbols. , I { } ( ) are used
for bases that are essentially unpaired, weakly paired, strongly paired without
preferred direction, weakly upstream (downstream) paired, and strongly upstream (downstream) paired, respectively. Apart from the console output the
two postscript files rna. ps and dot. ps are created. The lower left part of dot. ps
shows the minimum energy structure, while the upper right shows the pair probabilities. The area of the squares is proportional to the binding probability.
3. Parallel Folding Algorithm.
We provide an implementation of the folding algorithm for parallel computers. In the following we present a parallelized version of the minimum
energy folding algorithm for message passing systems. Since all subsequences
of equal length can be computed concurrently, it is advisable to compute the
-8-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
Input string (upper or
......... 1
lo~er
2
RNAs
case); G to quit
3
4
5
6
7
.
CACUACUCCAAGGACCGUAUCUUUCUCAGUGCGACAGUAA
.«•...... «
»..
11
length = 40
CACUACUCCAAGGACCGUAUCUUUCUCAGUGCGACAGUAA
.«««•. ««(. .... »») ... ») ..... ») ..
minimum free energy = 0.83
a)
CACUACUCCAAGGACCGUAUCUUUCUCAGUGCGACAGUAA
««•.... (((((.
»»)
»»
.
minimum free energy = -1.52
b)
Figure 2: a) Example Session of RNAfold -c. The constraints are provided as a string
consisting of dots for bases without constraint, matching pairs of round brackets
for base pairs to be enforced, the symbols '<' and '>' for bases that are paired upstream and downstream, respectively, and the pipe symbol' I' denoting a paired
base with unknown pairing partner. b) shows minimum free energy structure
without contraints for comparison.
arrays F, C and FM (defined in table 1) in diagonal order, dividing each subdiagonal into P pieces. Figure 3 shows an example for 4 processors. Our
algorithm stores the arrays F and FM both as columns and rows, while the C
array is stored in diagonal order. The maximal memory requirement occurs
at d = n/2, where we need n 2 /(2P) integers each for F and FM, while the
array
C
needs only O(n) storage. Since the length of the rows and columns
increases, one needs to reorganize the storage after each diagonal. If one allocates twice the minimum memory, storage has to be reorganized only once
and the total requirement is the same as for the sequential algorithm. After
completing one subdiagonal each processor has to either send a row to or
receive a column from its right neighbour, and it has to either receive a row
from or send a column to its left neighbour.
Since we do not store the entire matrices, we cannot do the usual backtracking to retrieve the structure corresponding to the minimum energy. Instead, we write for each index pair (i,j) two integers to a file, which identify
-9-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
Figure 3: Representation of memory usage by the parallel folding algorithm. The triangle
representing the triangular matrices F, C, and FM, respectively, is divided into
sectors with an equal number of diagonal elements, one for each processor. The
computation proceeds from the main diagonal towards the upper right corner.
The information needed by processor 2 in order to calculate the elements of the
dashed diagonal are highlighted. To compute its part of the dashed diagonal
processor 2 needs the horizontally and vertically striped parts of the arrays F
and FM, and the shaded part of the array c. The shaded part does not extend to
diagonal, because we have restricted the maximal size of interior loops.
the term that actually produced the minimum. The backtracking can then
be done with O(n) readouts. All in all we need O(n) communication and
I/O steps each transferring O(n) integers, while the computational effort is
O(n3 ). The communication overhead therefore becomes negligible for sufficiently long sequences.
On the Intel Ipse/2 the advantage of storing F and FM in rows and
columns outweighs the I/O overhead for sequences longer than some 200
nucleotides. The efficiency of the parallelization as a function of sequence
length and number of processors can be seen in figure 4.
The partition function algorithm can be parallelized analogously.
-10 -
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
.=
'''•......'.
' .... ...........
~
~
20
_
.
-'-OL-----~----~~---~-----"
1
2
4
Number of Processors
8
16
Figure 4: Performance of parallel algorithm for random sequences of length 50 0, 100 0,
200 <>, 400 6, 700 <l, 1000 '7. Efficiency is defined as speedup divided by the
number of processors. The dotted line is lIn corresponding to no speedup at
all.
4. Inverse Folding
Inverse folding is highly relevant for two reasons: (1) to find sequences
that form a predefined structure under the minimum free energy criterium,
and (2) to search for sequences whose Boltzmann ensembles of structures
match a given predefined structure more closely than a given threshold. The
first aspect is directly addressed by the inverse folding algorithm presented
here. The second issue is somewhat more realistic. In the case of compatible
sequences, that is sequences which can form a base pair wherever it is required in the target structure, we shall often find the desired structure among
suboptimal foldings with free energies close to the minimal value. Matching base pairing probabilities can be approached by the same inverse folding
heuristic using the base pairing matrices of the partition function algorithm
rather than minimum free energy structures.
-11-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
Only compatible sequences are considered as candidates in the inverse
folding procedure. Clearly, a compatible sequence can but need not have the
target structure as its minimum free energy structure.
Our basic approach is to modify an initial sequence 10 , such as to minimize a cost function given by the "structure distance" f(l)
= d(S(I), T)
between the structure S(I) of the test sequence 1 and the target structure
T. A set of possible distance measures is discussed in section 5. The actual choice of this distance measure is not critical for the performance of the
algorithm.
This procedure reqUIres many evaluations of the cost functions and
thereby many executions of the folding algorithm. Instead of running the
optimization directly on the full length sequence, we optimize small substructures first, proceeding to larger ones as shown in the flow chart (figure 5).
This reduces the probability of getting stuck in a local minimum, and more
important, it reduces the number of foldings of full length sequences. This
is possible because substructures contribute additively to the energy. If T is
an optimal structure on the sequence S and contains the base pair (i,j) then
the substructure T;,j must be optimal On the subsequence Si,j. It is likely
then, but by nO means necessary, that the COnverse also holds: A structure
that is optimal for a subsequence will also appear with enhanced probability
as a substructure of the full sequence.
For the actual optimization (denoted by 'FindStructure' in figure 5) we
use the simplest possibility, an adaptive walk. In general, an adaptive walk
will try a random mutation, and accept it iff the cost function decreases. A
mutation consists in exchanging one base at positions that are unpaired in the
target structure T, or in exchanging two bases, while retaining compatibility,
if their corresponding positions pair in T. If no advantageous mutation can
be found, the procedure stops, and restart again with a new initial string
-12 -
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
HND HAIRPIN
SUBSTRUCTURE:=HAIRPIN
ELONGATESUBSTRUCTUBE BY
I BP
HND SUBSEQUENCE
N
Y
EWNGATESUBSTRUCTURE
TO INCLUDE EXTERIOR BP OF
THE MULTILOOP
ADD EXTERNAL BASES TO
COMPONENT
HND SUBSEQUENCE
N
CONCATENATE COMPONENTS
HND SEQUENCE TO FULL STRUCTURE
Figure 5: Flow chart of the inverse folding algorithm.
10 • Optionally, we can restrict the search to bases that do not pair correctly.
This slightly increases the probability that no sequence can be found, but
greatly reduced the search space.
Sequences found by the inverse folding algorithm will often allow alternative structures with only slightly higher energies. To find sequences with
-13 -
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
1000
100
10
-
T-40"(U100)A3.5
1SO·L----.J70,------c-10LO-----1..cSO,-----2.J0-,-0--2--'SO
Length
Figure 6: Performance of inverse folding. full line: T=4·10- 6 n 3 . 5 .
clearly defined structures, the partition function algorithm can be used to
maximize the probability
peT)
1
= Q exp( -b..G(T)/RT)
(6)
of the desired structure in the ensemble. This procedure is much slower, since
the optimization is done with the full length sequence.
5. Comparison of Secondary Structures
RNA secondary structures can be represented as trees (Zuker & Sankofl:'
1984, Shapiro 1988, Shapiro & Zhang 1990). The tree representation was
used, for example, to obtain coarse grained structures revealing the branching pattern or the relative positions of loops. More recently we proposed a
tree representation at full resolution (Fontana et ai. 1993b). A secondary
structure Sk is converted one to one into a tree Tk by assigning an internal
node to each base pair and a leaf node to each unpaired digit (Figure 8). The
-14 -
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
tram> RNAinverse -Fm -R -3
Input structure k start string (lover case letters for const positions)
G to quit, and 0 for random start string
......... 1.
2
3
4
5
6
7 .....•...
«««( .. ((((.
o
»». ««(.
»»)
« «(.
» »»»»» ....
length = 76
CUAUACUACGAGGAUAAUCUGCCUUUUGCCAAAGAGGGUGGCAUUUCAUCAGCUCCGAAUGCUGAGGUAUAGCGAA 20
AGCUCUGAUAUCUCUACGAAUAGAUCCUUUAUAUCUCUUAAAGCGUGUCUGGAAGAUAACUCCAGCAGAGCUUGUG 25
UUCUCCUGUAGUCGACUUUAGGACUCGAGGCCGUAUUUGCCUCACGGAAAUGUUUACAAUGCAUUAGGAGGAGUGC 29
A
tram> RNAinverse -Fp -R 3
Input structure t start string (lower case letters for const positions)
G to quit, and 0 for random start string
......... 1.
2 ••..••••• 3 ••....••. 4 ..••••••• 5
6
7 ....••.••
«««(. .««
o
»». ««(.
»»)
« «(.
»»»»»»
.
length = 76
GCUAGCGUUGGGCUUUUUUUCGCCCUGCCGCAAAACCCGCGGCUUCUCGCUACAUCUCUCGUAGCCGCUAGCAAAA 50
(0.844786)
GCGUUACAAGCGCAAUCCCCCGCGCAGCGUCAAAACCCGACGCCAACAGCUACAAAACCCGUAGCGUAACGCAAAA 55
(0.859519)
GCGCGCCAAGCGCAAAAAAAAGCGCAGCCGCAAAACACGCGGCAAAAAGCGGCAGAAAAAGCCGCGGCGCGCAAAA 49
(0.85046)
B
Figure 7: Sample session of RNAinverse.
A: Minimum free energy_ B: Partition func-
tion.
Data on the natural tRNAphe sequence for comparison: the clover leaf structure
occurs only with a probability of 0.172511 and with even smaller probabilities
for the three sequences found by the RNAinverse -Fm (0.08917I7, 0.0329602, and
0.03357I3, respectively).
conversion starts with a root which does not correspond to a physical unit of
the RNA molecule. It is introduced to prevent the formation of a tree forest
for RNA structures with external elements (For details of the interconversion
of secondary structures and trees see also Fontana et at. 1993b).
As shown in (Fontana et at. 1993b) the trees T k can be rewritten as
homeomorphica1ly irreducible trees (HITs), see appendix B. The transformation from the full tree to the HIT retains complete information on the
structure. Secondary structure, full tree as well as HIT are equivalent.
Tree editing induces a metric in the space of trees (see appendix A), and
in the space of RNA secondary structures. A tree is transformed into another
-15 -
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
A
,,
,
'"
RNAs
B
.--.--..
'
Figure 8: Interconversion of secondary structures and trees. A secondary structure graph
(A) is equivalent to an ordered rooted tree (B). An internal node (black) of the
tree corresponds to a base pair (two nucleotides), a leaf node (white) corresponds
to one unpaired nucleotide, and the root node (black square) is a virtual parent
to the external elements. Contiguous base pair stacks translate into "ropes"
of internal nodes and loops appear as bushes of leaves. Recursively traversing
a tree by first visiting the root, then visiting its subtrees in left to right order,
finally visiting the root again, assigns numbers to the nodes in correspondence to
the 5'-3' positions along the sequence (Internal nodes are assigned two numbers
reflecting the paired positions).
tree by a series of editing operations with predefined costs (Tai 1979, Sankoff
& Kruskal 1983). The distance between two trees d(Tj, Tk) is the smallest
sum of the costs along an editing path. The parameters used in our tree
editing are summarized in appendix A. The editing operations preserve the
relative traversal order of the tree nodes. Tree editing can therefore be viewed
as a generalization of sequence alignment. In fact, for trees that consist solely
of leaves, tree editing becomes the standard sequence alignment. A sample
-16 -
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
session computing the tree distance between two arbitrarily chosen structures
is shown in figure 9.
An alternative graphical method for the comparison of RNA secondary
structures (Hogeweg & Hesper 1984, Konings 1989, Konings & Hogeweg
1989) encodes secondary structures as linear strings with balanced parentheses representing the base pairs, and some other symbol coding for unpaired
positions.
Tree representations in full resolution make it often difficult to focus on
the major structural features of RNA molecules since they are often overloaded with irrelevant details. Coarse-grained tree representations were invented previously to solve this problem (Shapiro 1988).
tram> RXAdistance -DfhvcHWC -B
Input structure; Q to quit
......... 1
«.«««(
.....«« .. ««
26
(-----(. «(--««
f:
h:
2
»»»).»
(-«( .. ««
3
4
S
« .. ««
»».» .
»».»)-)-
--.--«(
»».»»
«(. »).
-----» »-»).» « .. ««
6
7
.
»».».
-»)-.---
32
(----«Ul) «US-)P7) (Ul)P2) (U4) «U2)«US)P4) (Ul)P2) (Ul) Rl)
«US) «U2) «Ul0)P4)(Ul)P4)(US)-----«U4)P3) (Ul)-------Rl)
v:
34
«««HS-)S7)I2)S2)««HS)S4)I3)S2)ES-)Rl)
«««Hl0)S4)I3)S4)--«H4)S3)------Ell)Rl)
c:
3
««(Hl)Sl)Il)Sl)««Hl)Sl)Il)Sl)Rl)
««(Hl)Sl)Il)Sl)--«Hl)Sl)------Rl)
H: 31
(Rl------(P2(U1Ul) (P7(USUS)P7) (U1Ul)P2) (U4U4) (P2(U2U2) (P4(USUS)P4) (U1Ul)P2) (U1Ul)Rl)
(Rl(USUS) (P4(U2U2) (P4(U1Ul)P4) (U1Ul)P4) (USUS)--------- (P3(U4U4)P3)---------(U1Ul)Rl)
W: 32
(Rl(ES-(S2(I2(S7(HSHS)S7)I2)S2)(S2(I3(S4(HSHS)S4)I3)S2)ES)-Rl)
(Rl(Ell(S4(I3(S4(H1Hl)S4)I3)S4)------(S3(H4H4)S3)------Ell)Rl)
C: 3
(R(S(I(S(HH)S)I)S)(S(I(S(HH)S)I)S)R)
(R(S(I(S(HH)S)I)S)----(S(HH)----S)R)
Figure 9: Interactive sample session of RNAdistance. For this example we have used
two random sequences folded by RNAfold.
-17-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
Base pairing probability matrices may also be compared by an
alignment-type method (Bonhoeffer et ai. 1993). Since a secondary structure is representable as a string (see figure 9), comparison of structures can
be done by standard string alignment algorithms (e.g. Waterman 1984). This
approach has been generalized to structure ensembles in (Bonhoeffer et ai.
1993). We compute for each position i of the sequence the probability to be
upstream paired, downstream paired or unpaired.
p~
= LPij
j>i
(7)
pl = LPij
j<i
The probability that the base at position i is unpaired is
pi =
1 - P~ -
p}.
A reasonable definition for the distance of two such vectors, p(Sl) and
p(S2), uses again an alignment procedure at the level of the vectors p<, p)
and pO. We then define the distance measure for an aligned position (i, j) by
(8)
and 5( i) = 0 for inserted or deleted positions. The distance of two structure
ensembles is given by the minimum total edit costs as in an ordinary string
alignment (The numerical value of this distance is twice the distance measure
defined in Bonhoeffer et ai. 1993).
-18 -
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
6. Applications
6.1. Long RNA Molecules
16sRNA, yeast.
Chain length: 1542 nucleotides.
Performance: 42min on an IBM-RS6000!550 with 64 Mbyte.
Table 4: Free energies of 16sRNA from yeast. The phylogenetic structure
decomposes into six components.
Bases
1
557
884
921
1397
1498
-
-
MFE
556
883
920
1396
1497
1542
~
1
-
1542
-130.63
-89.18
-6.55
-111.66
-21.00
-15.11
-374.13
-379.22
phylogenetic structure
"as is"
completed
-77.14
-97.76
-70.64
-61.31
-6.55
-3.76
-37.68
-68.70
-18.15
-18.15
-14.11
-13.91
-211.95
-275.91
-279.48
-211.95
Rem.ark. The fourth component, bases 921-1396, shows the largest deviations. This
region contains large multiloops. It seems that in general local interactions are predicted
more reliably than long range base pairs.
Minimum free energy structure contains 259 (58%) of the 441 base pairs
predicted by a phylogenie structure (5 uncommon base pairs, and the 5 pairs
forming a pseudo-knot are not counted). The cummulated pairing probability of the bases in the phylogenetic structure is about 54%. The cummulated pairing probability of the bases in the minimum free energy structure is 73%. The phlogenetic structure "as is" amounts to a free energy of
-212kcaljmol as compared to -379kcaljmol for the minimum free energy
structure. The difference of more than 300 RT indicates that the phylogenetic structure cannot be complete. We did a minimum free energy folding of
-19 -
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
the sequence subject to the base pairs prescribed by the phylogenetic structure using RNAfold
-c. One finds -279.5kcal/mol. This is still far away
from the minimum free energy structure. The omitted interactions due to
the pseudoknot and the uncommon base pairs cannot account for more than
some 20 kcal/mol in the worst case. We are aware of the following possible
explanations for the gap of about 100 RT:
• The energy model for secondary structures is totally wrong.
• 16sRNA is selected for its biochemical function which is not preformed
by the free RNA, but by its complex with protein. The phylogenetically determined "structure" may then be completely different from
the minimum free energy structure.
Q(3 RNA.
Chain length: 4220 nucleotides.
Performance: 9.5h on an IBM-RS6000/S60 with 256 Mbyte.
Folding of long RNA sequences can be extended up to chain lengths of
about 5000 nucleotides. The entire Q(3-genome (Figure 10) was folded as
a test case of an example of an entire viral genome. We do not imply that
the real secondary structure of Q(3 RNA is identical with its minimum free
energy structure. Kinetic effects are highly important for long sequences.
The secondary structure obtained appears to be partitioned into three
parts: (0-861), (862-3003), and (3004-4220). The middle part contains a large
loop and other gross features which are also revealed by electron microscopy
(Jacobson 1991).
- 20-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
. ., . ,
~
.' .
-.
,.
/
,
/
.-
Figure 10: The base pairing probabilities for the entire
Qf3 genome (4420 nucleotides).
6.2. Heat Capacity of RNA Secondary Structures
The heat capacity can be obtained from the partition function using
EPG
Cp = -T 8T2'
and
t:.G = -RTlnQ
(9)
The numerical differentiation is performed in the following way: We fit
the function F( x) by the least square parabola y = cx 2
2m + 1 equally spaced points Xo - mh, Xo - (m -l)h,
- 21-
+ bx + a through the
..., Xo, ..• , Xo + mho
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
4
o t::::::::::::.~~~~~~--~~~,---::::="J
o
20
40
60
80
100
120
Temperature [C]
Figure 11: Heat Capacity of the tRNA-phe from yeast calculated using RNAheat -Tmax
120 -h 0.1 -m5.
The second derivative of F is then approximated by F"(xo) = 2c. Explicitly,
we obtain
F"(xo) =
f
k=-m
2
30 (3k - m(m + 1))
F(xo
2
m(4m -l)(m + 1)(2m + 3)
+ kh)
(10)
As an example we show the heat capacity of tRNAtextphe from yeast (Figure 11).
6.3. Landscapes and combinatory maps
Statistical features of the relations between sequences, structures and
other properties of RNA molecules were studied in several previous papers
(Fontana et al. 1991, 1993a,b; Bonhoeffer et al. 1993, Tacker et al. 1993). We
mention here free energy landscapes and structure maps as two representative
examples. Relations between sequences I and derived objects 9(1) (which
may be numbers like free energies, or structures, or trees, or anything that
- 22-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
might be relevant in studying RNA molecules) are considered as mappings
from one metric space (the sequence space X) into another metric space (Y),
(11)
The existence of a distance measure d g inducing a metric on the set of objects !I(I) is assumed. In the trivial case
Y is JR. We are then dealing with
a "landscape" which assigns a value to every sequence (a free energy, or an
activation energy, for example). In general both spaces will be of combinatorial complexity and we coined the term "combinatory maps" for these
mappings. In order to facilitate illustration and collection of data for statistical purposes a probability density surface is computed which expresses
the conditional probability that two arbitrarily chosen sequences I
Hamming distance dh(i,j) = h form two objects
i
I
j
of
!Ii and !lj at a distance
dg(i,j) = 'Y:
Pg('Ylh) - Prob (dg(i,j)
= 'Y I dh(i,j) = h) .
(12)
In the case of free energies we have 'Y = Ifj - fil, in case of the structure
density surface (SDS) (Fontana et al. 1993b, Schuster et al. 1993) 'Y is the tree
distance between the two structures obtained from the two sequences Ii and
Ij. An example of a typical probability density surface of RNA secondary
structures is shown in figure 12. These structure density surfaces may be
used to compute various quantities. Autocorrelation functions, for example,
are obtained by a straightforward calculation (Fontana et al. 1993a,b).
- 23-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
0. 10
100
80
"ac:
<.>
~
'"
'"c:
°E
'i5
60
~
E
a
:c
20
40
Structure distance
60
Figure 12: The structure density surface (SDS) for RNA sequences of length n=lOO
(upper). This surface was obtained as follows: (1) choose a random reference
sequence and compute its structure, (2) sample randomly 10 different sequences
in each distance class (Hamming distance 1 to 100) from the reference sequence,
and bin the distances between their structures and the reference structure. This
procedure was repeated for 1000 random reference sequences. Convergence is
remarkably fastj no substantial changes were observed when doubling the number of reference sequences. This procedure conditions the density surface to
sequences with base composition peaked at uniformity, and does, therefore, not
yield information about strongly biased compositions.
Lower part: Contour plot of the SDS.
- 24-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
II
II
Figure 13: Schema of the folding of an RNA hybrid.
6.4. RNA Hybridisation
The folding algorithm can be generalized in a straightforward way to
compute the structure resulting from the hybridisation of several RNA
strands. This is done by concatenating the, say, N strands in all possible permutations. Any such permutation is folded almost as usual. The
only difference concerns "loops" or a "stacked pair" which contain end-toend junctions of individual strands, and which are treated as appropriate
external elements. The structure with the least free energy among all optimal structures corresponding to the N! sequence permutations is the optimal
hybrid structure. In Figure 13 we show an example for N = 2.
- 25-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
6.5. Other applications
We mention investigations based on other criteria for RNA folding as, for
example, maximum matching of bases or various kinetic folding algorithms
which can be readily incorporated into the package. RNA melting kinetics
has been studied as well by the statistical techniques mentioned here (Tacker
et al. 1993).
In continuation of previous work (Fontana & Schuster 1987, Fontana et
al. 1989) we started simulations experiments in molecular evolution which
aim at a better understanding of evolutionary optimization processes on a
molecular scale. Studies of this kind are becoming relevant for the design of
efficient experiments in applied molecular evolution (Eigen & Gardiner 1984,
Kauffman 1992).
Acknowledgements
The work reported here was supported financially in part by the Austrian
Fonds zur Forderung der wissenschaftlichen Forschung (Project No. S 5305PHY and Project No. P 8526-MOB). The Institut fur Molekulare Biotechnologie e.V. is financed by the Thuringer Ministerium fur Wissenschaft und
Kunst and by the German Bundesministerium fUr Forschung und Technologie.
The Santa Fe Institute is sponsored by core funding from the
U.S. Department of Energy (ER-FG05-88ER25054) and the National Science Foundation (PHY-8714918). Computer facilities were supplied by the
computer centers of the Technical University and the University of Vienna.
- 26-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
References
Bandelt, H.-J. & A.W.M.Dress. (1992) Canonical Decomposition Theory of
Metrics on Finite Sets. Adv.Math. 92, 47-105.
Beaudry, A.A. & G.F.Joyce. (1992) Directed evolution of an RNA
enzyme. Science 257, 635-641.
Bonhoeffer S., J.S.McCaskill, P.F.Stadler & P. Schuster. (1993)
Temperature dependent RNA landscapes. A study based on partition functions. Eur.Biophys.J. 22, 13-24.
Cech, T.R. & B.L.Bass. (1986) Biological catalysis by RNA.
Ann.Rev.Biocbem. 55, 599-629.
Eigen, M. & W.Gardiner. (1984) Evolutionary molecular engineering based
on RNA replication. Pure & App1. Cbem. 56, 967-978.
Eigen, M., R.Winkler-Oswatitsch & A.Dress. (1988) Statistical geometry
in sequence space: A method of quantitative comparative sequence
analysis. Proc.Nat1.Acad.Sci. USA 85, 5913-5917.
Ellington, A.D. & J.W.Szostak. (1990) In vitro selection of RNA molecules
that bind to specific ligands. Nature 346, 818-822.
Fontana W. & P.Schuster. (1987) A computer model of evolutionary optimization. Biopbys.Chem. 26, 123-147.
Fontana W., W.Schnabl & P.Schuster. (1989) Physical aspects of evolutionary optimization and adaptation. Pbys.Rev.A 40, 3301-3321.
Fontana W., T.Griesmacher, W.Schnabl, P.F.Stadler & P.Schuster. (1991)
Statistics of landscapes based on free energies, replication and
degradation rate constants of RNA secondary structures.
Mb.Chem. 122, 795-819.
Fontana W., P.F.Stadler, E.Bornberg-Bauer, T.Griesmacher, LL.Hofacker,
M.Tacker, P.Tarazona, E.D.Weinberger & P.Schuster. (1993a)
RNA folding and combinatory landscapes. Phys.Rev.E 47,
2083-2099.
Fontana W., D.A.M.Konings, P.F.Stadler & P.Schuster. (1993b)
Statistics of RNA secondary structures. Biopolymers, in press.
- 27-
- -
- ---
-- -------------------
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
Freier, S.M., RKierzek, J.A.Jaeger, N.Sugimoto, M.H.Caruthers, T.Neilson
& D.H.Turner. (1986) Improved free-energy parameters for predictions of RNA duplex stability. Proc.Natl.Acad.Sci. USA 83, 93739377.
Hofacker LL., P.Schuster & P.F.Stadler. (1993) Combinatorics of RNA
Secondary Structures. Preprint.
Hogeweg, P. & B.Hesper. (1984) Energy Directed Folding of RNA Sequences. Nucleic Acids Res. 12, 67-74.
Jacobson, A.B. (1991) Secondary structure of Coliphage QfJ RNA. Analysis by electron microscopy. J.Mol.Biol. 221, 557-570.
Jaeger J.A., D.H.Turner & M.Zuker. (1989) Improved predictions of secondary structures for RNA. Proc.Natl.Acad.Sci. USA 86, 77067710.
Kauffman, S.A. (1992) Applied molecular evolution. J. Theor.Biol. 157,
1-7.
Konings, D.A.M. (1989) Pattern analysis of RNA secondary structure.
Proefschrift, Rijksuniversiteit te Utrecht.
Konings, D.A.M. & P.Hogeweg. (1989) Pattern analysis of RNA secondary structure. Similarity and consensus of minimal-energy folding. J.Mol.Biol. 207, 597-614.
Martinez, H.M. (1984) An RNA Folding Rule. Nucleic Acids Res. 12, 323334.
McCaskill, J.S. (1990) The equilibrium partition function and base pair
binding probabilities for RNA secondary structure.
Biopolymers 29, 1105-1119.
Nussinov, R, G.Pieczenik, J.RGriggs & D.J.Kleitman. (1978) Algorithms
for loop matching. SIAM J.Appl.Math. 35, 68-82.
Nussinov, R & A.B.Jacobson. (1980) Fast algorithm for predicting the
secondary structure of single-stranded RNA.
Proc.Natl.Acad.Sci. USA 77, 6309-6313.
Sankoff, D. & J.B.Kruskal, eds. (1983) Time warps, string edits and
macromolecules: The theory and practice of of sequence comparison. Addison Wesley, London.
- 28-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
Schuster, P., W.Fontana, P.F.Stadler & I. L. Hofacker. (1993) The distribution of RNA structures over sequences. Preprint.
Shapiro, B. (1988) An algorithm for comparing multiple RNA secondary
structures. CABIOS 4, 387-393.
Shapiro, B. & K,Zhang. (1990) Comparing multiple RNA secondary structures using tree comparisons. CABIOS 6, 309-318.
Symons, R.H. (1992) Small catalytic RNAs. Ann.Rev.Biochem. 61,
641-671.
Tacker, M., W.Fontana, P.F.Stadler & P.Schuster. (1993) Statistics of
RNA melting kinetics. Preprint.
Tai, K, (1979) The Tree-to-tree Correction Probelem. J.ACM 26, 422433.
Waterman, M.S. (1978) Secondary structure of single-stranded nucleic
acids. In: Studies in Foundations and Combinatorics. Advances in
Mathematics Suppl. Studies. Vol.1., pp.167-212. Academic Press,
New York.
Waterman, M.S. (1984) General Methods of Sequence Comparison. Bull.Math.Biol. 46, 473-500.
Waterman, M.S. & T.F.Smith. (1978) RNA secondary structure: A complete mathematical analysis. Math.Biosc. 42, 257-266.
Zuker, M. & P.Stiegler. (1981) Optimal computer folding of large RNA
sequences using thermodynamic and auxilary information. Nucleic
Acids Res. 9, 133-148.
Zuker, M. & D.Sankoff. (1984) RNA secondary structures and their prediction. Bull.Math.Biol. 46, 591-621.
Zuker, M. (1989) On finding all suboptimal foldings of an RNA molecule.
Science 244, 48-52.
- 29-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
Appendix A: Edit Distances and Edit Costs
Let x, y, z be rooted ordered plane trees with labelled and weighted
nodes. The labels are taken from some alphabet A, the weights are nonnegative real numbers. Denote the set of all such trees by TA. (For the
relation of trees and secondary structures see figure 5 and appendix B.)
We define elementary edit costs
oa for insering or deleting a vertex with
label a E A and O'ba = O'ba for replacing a vertex with label a by a vertex
with label b. We require
>
>
<
o
o
(Tac
and
O'aa = 0
V
a E A
V
a#b
V a,b,c E A
+ Ucb
(A.l)
The last line implies, that if substitution of a and b is allowed, than its cost
is at most the cost for first deleting a and then inserting b.
Denote a vertex with label a E A and weight v ::::: 0 by [a, v]. We then
define the weighted edit costs:
.6.([a, v]) = vOa
E([a,v], [b,w])
= min(v,w)O'ab + Iv -
wi
{~:
if
if
v:::::
w
v :::; w
(A.2)
for insertion/deletion and substitution, respectively. The tree edit distance
dT(X, y) of any two trees x, y E TA is defined as the minimum total edit
cost needed to transform x into y, such that the partial order of the tree
remains untouched. For a description of tree editing in general, and of its
implementation as a dynamic programming algorithm see e.g. Tai (1979) and
Shapiro (1990). It is easy to see that dT is a metric on TA for any alphabet
A.
Any tree x E TA can be transformed into a linear parenthesized representation by appropriately traversing the tree. (A leaf is represented by both
an open and closed parentehsis.) This representation is string-like in the
- 30-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
sense that one may interpret each parenthesis as a generalized character j of
a string
x.
Each character is defined by its label, its weight, and its "sign",
that is, whether it is an open or a closed parenthesis,
e= [a, v, p] with a E A,
v ::::: 0, and p ='(' or ')'. Its is easy to see that such a generalized string can
be represented as a tree, provided it contains only matching parentheses and
weigths and labels coincide for pairs of matching parentheses.
We define the generalized string edit distance ds(x, jj) of two generalized
strings
x and jj as the minimum total edit cost needed to transform x into
jj, where the edit costs for each step are given by
6.([a, v,p]) = .6.([a, v])
.f>([
] [b ,w,q]) -_ {~([a, v], [b, w])
£..J a,v,p,
00
if
if
p
=q
(A.3)
PI' q
for insertion/deletion and substitution, respectively. This is to say that open
parentheses may be matched with open parentheses only, and closed parentheses with closed parentheses only.
Corresponding to the above two distance measures we have two types of
alignments: In case of the edit distance for generalized strings it is straight
forward. For the tree edit distance we have to transform the aligned trees
into their linear representation in order to print them. This implies that each
inserted or deleted node is complemented by gap characters in two positions,
corresponding to the left and right bracket, respectively, and that matched
nodes give rise to two matches in the linear representation. (See figure 9 for
examples). As a consequence, each alignment of two trees x and y can be
written as a valid alignment of the corresponding generalized string
x and jj
exhibiting the same edit cost. Hence
ds(x, jj) ::; dT(X, y)
- 31-
for all x,y E TA
(A.4)
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
In particular, let x be a subtree of y. Then d.(x, Ii) = dT(X, y). As
a consequence there are arbitrarily large trees x and y such that the two
distance measures coincide. The cost matrix we use is:
0
a
1
2
2
2
2
2
1
1
00
U
1
a
P
2
1
1
a
00
00
00
00
H
2
B
2
I
2
M
2
S
E
R
1
1
00
0
00
00
00
00
00
00
00
U
00
00
00
00
00
00
00
2
2
1
2
2
2
00
00
00
P
H
00
00
00
B
00
00
00
00
00
00
I
M
a
a
00
00
00
00
2
2
2
00
00
00
00
00
00
a
00
00
00
00
00
00
00
00
a
S
00
00
E
00
00
00
00
00
00
00
00
1
2
a
2
a
a
R
Appendix B: Representation of Secondary Structures
The simplest way of representing a secondary structure is the "parenthesis format" , where matching parentheses symbolize base pairs and unpaired
bases are shown as dots. Alternatively, one may use two types of node labels, 'P' for paired and 'D' for unpaired; a dot is then replaced by '(D)',
and each closed bracket is assigned an additional identifier 'P'. In (Fontana
et al. 1993b) a condensed representation of the secondary structure is proposed, the so-called homeomorphically irreducible tree (HIT) representation.
Here a stack is represented as a single pair of matching brackets labelled 'P'
and weighted by the number of base pairs. Correspondingly, a contiguous
strain of unpaired bases is shown as pair of matching bracket labelled 'D'
and weighted by its length.
Bruce Shapiro (1988) proposed another representation, which, however,
does not retain the full information of the secondary structure. He represents
the different structure elements by single matching brackets and labels them
as 'H' (hairpins loop), 'I' (interiorloop), 'B' (bulge), 'M' (multi-loop), and'S'
- 32-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF
RNAs
a) .((((..(((. .. ))) .. (( .. )))).)).
(U)««(U)(U)««U)(U)(U)p)p)p)(U)(U)«(U)(U)p)p)p)p)(u)p)p)(U)
b) (U)«(U2)«U3)P3)(U2)«U2)P2)P2)(U)P2)(U)
c) « (H) (H)K)B)
«««H)S)«H)S)K)S)B)S)
«««(H)S)«H)S)K)S)B)S)E)
d) «««(H3)S3)«H2)S2)K4)S2)B1)S2)E2)
e) «H)(H)K)
a) (UU)(p(p(p(p(UU)(UU)(p(p(p(UU) (UU) (UU)p)p)p) (UU)(UU) (p(p(UU)(U ...
b) (UU)(P2(P2(U2U2)(P2(U3U3)P3) (U2U2) (P2(U2U2)P2)P2)(UU)P2 )(UU)
c) (B(K(HH)(HH)K)B)
(S(B(S(K(S(HH)S)(S(HH)S)K)S)B)S)
(E(S(B(S(K(S(HH)S)(S(HH)S)K)S)B)S)E)
d) (E2(S2(B1(S2(K4(S3(H3)S3)«H2)S2)K4)S2)B1)S2)E2)
e) (K(HH)(HH)K)
Figure B.l: Linear representations of secondary structures used by the Vienna RNA
package.
Above: Tree representations of secondary structures.
a) Full structure: the
first line shows the more convenient condensed notation which is used by our
programs; the second line shows the rather clumsy expanded notation for completeness, b) HIT structure, c) different versions of coarse grained structures:
the second line is exactly Shapiro's representation, the first line is obtained by
neglecting the stems. Since each loop is closed by a unique stem, these two lines
are equivalent. The third line is an extension taking into also the external digits.
d) weighted coarse structure, e) branching structure.
Below: The corresponding tree in the notation used for the output of the stringtype alignments.
Virtual roots are not shown here. The program RNAdistance accepts any of
the above tree representations, the string-type representations occur on output
only.
(stack). We extend his alphabet by an extra letter for external elements 'E'.
Again the corresponding trees may be weighted by the number of unpaired
bases or base pairs constituting them. An even more coarse grained representation considers the branching structure only. It is obtained by considering
the hairpins and multiloops only (Hofacker et al. 1993). All tree representations (except for the condensed dot-bracket form) can be encapsulated into
a virtual root (labeled 'R'), see also the example session in figure 6.
In order to discriminate the alignments produced by tree-editing from
the alignment obtained by generalized string editing we put the label on both
- 33-
HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs
sides in the latter case.
Appendix C: List of Executables
RNAfold [-p[oJJ
[-T tempJ [-4J [-noGuJ
[-0 o_sotJ
[-CJ
RNAdistanco [-D[fhwcFHlIcJJ [-X[plmlflcJJ [-sJ [-B [filoJJ
RNApdist [-XpmfcJ
[-B [filoJJ [-T tempJ
[-4J [-noGuJ [-0 o_sotJ
RNAhoat [-Tmin t1J [-Tmax t2J [-h stepsizoJ [-m ipointsJ [-4J [-noGuJ [-0 o_sotJ
RNAinvorso [-F [mpJ J [-a ALPHABETJ [-R [ropoatsJJ [-T tempJ [-4J [-noGuJ
RNAoval [-T tempJ [-4J
[-0 o_sotJ
[-0 o_sotJ
Appendix D: Availability of Vienna RNA Package
Vienna RNA package is public domain software. It can be obtained by
anonymous ftp from the server ftp.itc.univie.ac.at, directory /pub/RNA
as ViennaRNA-1. 03. tar .Z. The package is written in ANSI C and is known
to compile on SUN SPARC Stations, IBM-RS/6000, HP-720, and Silicon
Graphics Indigo. For comments and bug reports please send e-mail messages
to ivo0itc.univie.ac.at.
The parallelized version of the minimum free energy folding is not part
of the package. It can be obtained upon request from the authors.
- 34-
Vienna RNA Package
A Library for folding and comparing RNA secondary structures
Copyright ©1993 Ivo Hofacker and Peter Stadler
Chapter 1: Folding Routines
1
1. Folding Routines
1.1 Minimum free Energy Folding
The library provides a fast dynamic programming minimum free energy folding algorithm as
described by M. Zuker, P. Stiegler (1981) (see Chapter 5 [References], page 13). The energy
parameters are taken from Turner et al. (1988) and Jaeger et al. (1989). If you want to try other
energy parameters look at energy_par.h in the source files. Associated functions are:
• float fold(char *sequence, char *structure)
folds the sequence and returns the minimum free energy in Kcal/Mol; structure contains
the mfe structure in bracket notation (see Section 2.1 [notations], page 5). The structure
must be at least strlen(sequence) chars long. If fold_constrained (see Section 1.4 [Fold
Params], page 3) is 1, the structure string is interpreted on input as a list of constraints for
the folding. The characters" I x < > " mark bases that are paired, unpaired, paired upstream,
or downstream, respectively; matching brackets" ( ) " denote base pairs, dots"." are used for
unconstrained bases.
• float energy_of_struct(char *sequence, char *structure)
calculates the energy of structure on the sequence
• void initialize_fold(int length)
allocates memory for folding sequences not longer than length; sets up pairing matrix and
energy parameters. Has to be called before the first call to fold ().
• void free_arrays (void)
frees the memory allocated by initialize_fold.
• void update30ld_params(void)
call this to recalculate the pair matrix and energy parameters after a change in folding parameters like temperature (see Section 1.4 [Fold Params], page 3).
Prototypes for these functions are declared in fold.h.
1.2 Partition Function Folding
Instead of the minimum free energy structure the partition function of all possible structures and
from that the pairing probability for every possible pair can be calculated, using again a dynamic
programming algorithm described by J.S. McCaskill (1990) (see Chapter 5 [References], page 13).
The following functions are provided.
Chapter 1: Folding Routines
2
• float pf_fold(char *sequence, char *structure)
calculates the partition function of sequence and returns the free energy of the ensemble
in Kcal/Mol. If structure is not a NULL pointer on input, it contains on return a string
consisting of the letters " . , I { } ( ) " denoting bases that are essentially unpaired, weakly
paired, strongly paired without preference, weakly upstream (downstream) paired, or strongly
up- (down-)stream paired bases, respectively. Unless do_backtrack has been set to 0 (see
Section 1.4 [Fold Params], page 3), pr[iindx[i]-j] contains the probability that bases i and
j pair.
• void init_pf_fold(int length)
allocates memory for folding sequences not longer than length; sets up pairing matrix and
energy parameters. Has to be called before the first call to pLfoldO.
• void free_pf _arrays (void)
frees the memory allocated by init_pf_foldO.
• void update_pLparams(int length)
Call this function to recalculate the pair matrix and energy parameters after a change in folding
parameters like temperature (see Section 1.4 [Fold Params], page 3).
Constrained folding has not been implemented for the partition function fold. Prototypes for
these functions are declared in part_func .h.
1.3 Inverse Folding
We provide two functions that search for sequences with a given structure, thereby inverting
the folding routines.
• float inverse_fold(char *start, char *target)
searches for a sequence with minimum free energy structure target, starting with sequence
start. It returns 0 if the search was successful, otherwise a structure distance to target is
returned. The found sequence is returned in start. If give_up is set to 1, the function will
return as soon as it is clear that the search will be unsuccessful, this speeds up the algorithm
if you are ouly interested in exact solutions.
• float inverse_pf_fold(char *start, char *target)
searches for a sequence with maximum probability to fold into structure target using the
partition function algorithm. It returns -kT log(p) where p is the frequency of target in the
ensemble of possible structures. This is usually much slower than inverse_foldO.
The global variable char symbolset [] contains the allowed bases.
Chapter 1: Folding Routines
3
Prototypes for these functions are declared in inverse .h.
1.4 Global Variables for the Folding Routines
The following global variables change the behaviour the folding algorithms or contain additional
information after folding.
• int noGU
do not allow GU pairs if equal 1; default is O.
• int no_closingGU
if 1 allow GU pairs only inside stacks, not as closing pairs; default is O.
• int tetra_loop
include special stabilising energies for some tetra loops; default is 1.
• int energy_set
if 1 or 2: fold sequences from an artificial alphabet ABCD ... , where A pairs B, C pairs D, etc.
using either GC (1) o~ AU parameters (2); default is O.
• float temperature
rescale energy parameters to a temperature of temperature C. Default is 37C. You have to
call the update{*}paramsO functions after changing this parameter.
• struct bond { int i, j ;} *base_pair
Contains a list of base pairs after a call to foldO. base_pair [0] . i contains the total number
of pairs.
• float *pr
contains the base pair probability matrix after a call to pLfoldO.
• int *iindx
index array to move through pro The probability for base i andj to form a pair is in pr[iindx[i]-j].
• float pf _scale
a scaling factor used by pLfoldO to avoid overflows. Should be set to exp((FjkT)jlength)
where F is an estimate for the ensemble free energy, for example the minimum free energy.
You must call update_pLparamsO or init_pLfoldO after changing this parameter. If pLscaie
is -1 (the default) , an estimate will be provided automatically when calling init_pLfoldO
or update_pLparams O.
• int fold_constrained
calculate constrained minimum free energy structures. See Section 1.1 [mfe Fold], page 1, for
more information. Default is 0;
• int do _backtrack
if 0, do not calculate pair probabilities in pLfoldO; this is about twice as fast. Default is 1.
Chapter 1: Folding Routines
4
• char backtrack_type
only for use by inverse3oldO; 'C': force (l,N) to be paired, 'M' fold as if the sequence were
inside a multi-loop. Otherwise the usual mfe structure is computed.
include fold_vars.h if you want to change any of these variables form their defaults.
Chapter 2: Parsing and Comparing of Structures
5
2. Parsing and Comparing of Structures
j
2.1 Representations of Secondary Structures
The most simple way to represent a secondary structure is the "bracket notation", where matching brackets symbolise base pairs and unpaired bases are shown as dots. Alternatively, one may
use two types of node labels, 'P' for paired and 'u' for unpaired; a dot is then replaced by '(U)',
and each closed bracket is assigned an additional identifier 'P'. We call this the expanded notation.
In (Fontana et al. 1993) a condensed representation of the secondary structure is proposed, the
so-called homeomorphically irreducible tree (HIT) representation. Here a stack is represented as
a single pair of matching brackets labelled 'P' and weighted by the number of base pairs. Correspondingly, a contiguous strain of unpaired bases is shown as one pair of matching brackets labelled
'u' and weighted by its length. Generally any string consisting of matching brackets and identifiers
is equivalent to a plane tree with as many different types of nodes as there are identifiers.
Bruce Shapiro (1988) proposed another representation, which, however, does not retain the full
information of the secondary structure. He represents the different structure elements by single
matching brackets and labels them as 'H' (hairpin loop), 'I' (interior loop), 'B' (bulge), 'M' (multiloop), and'S' (stack). We extend his alphabet by an extra letter for external elements 'E'. Again
these identifiers may be followed by a weight corresponding to the number of unpaired bases or
base pairs in the structure element. All tree representations (except for the dot-bracket form) can
be encapsulated into a virtual root (labeled 'R'), see the example below.
The following example illustrates the different linear tree representations used by the package.
All lines show the same secondary structure.
.«« .. «( ... ))) .. « .. )))).)).
(U)««(U)(U)««U)(U)(U)P)P)P)(U)(U)«(U)(U)P)P)P)P)(U)P)P)(U)
b) (U)«(U2)«U3)P3)(U2)«U2)P2)P2)(U)P2)(U)
c) « (H)(H)M)B)
«««H)S)«H)S)M)S)B)S)
«««(H)S)«H)S)M)S)B)S)E)
d) ««««H3)S3)«H2)S2)M4)S2)Bl)S2)E2)R)
a)
Above: Tree representations of secondary structures. a) Full structure: the first line shows the
more convenient condensed notation which is used by our programs; the second line shows the
rather clumsy expanded notation for completeness, b) HIT structure, c) different versions of coarse
grained structures: the second line is exactly Shapiro's representation, the first line is obtained by
neglecting the stems. Since each loop is closed by a unique stem, these two lines are equivalent. The
Chapter 2: Parsing and Comparing of Structures
6
third line is an extension taking into account also the external digits. d) weighted coarse structure,
this time including the virtual root.
For the output of aligned structures from string editing, different representations are needed,
where we put the label on both sides. The above examples for tree representations would then look
like:
a) (UU) (P(P(P(P(UU) (UU) (P(P(P(UU) (UU) (UU)P)P)P) (UU) (UU) (P(P(UU) (U ...
b) (UU)(P2(P2(U2U2)(P2(U3U3)P3)(U2U2)(P2(U2U2)P2)P2)(UU)P2)(UU)
c) (B(M(HH) (HH)M)B)
(S(B(S(M(S(HH)S) (S(HH)S)M)S)B)S)
(E(S(B(S(M(S(HH)S) (S(HH)S)M)S)B)S)E)
d) (R(E2(S2(Bl(S2(M4(S3(H3)S3)((H2)S2)M4)S2)Bl)S2)E2)R)
Aligned structures additionally contain the gap character' _'.
2.2 Parsing and Coarse Graining of Structures
Several functions are provided for parsing structures and converting to different representations.
• char *expand_Full(char *full)
converts the full structure from bracket notation to the expanded notation including root.
• char *b2HIT(char *full)
converts the full structure from bracket notation to the HIT notation including root.
• char *b2C(char *full)
converts the full structure from bracket notation to the a coarse grained notation using the
'H' 'B' 'I' 'M' and 'R' identifiers.
• char *b2Shapiro(char *full)
converts the full structure from bracket notation to the weighted coarse grained notation
using the 'H' 'B' 'I' 'M' 's' 'E' and 'R'identifiers.
• char *expand_Shapiro(char *coarse)
inserts missing'S' identifiers in unweighted coarse grained structures as obtained from b2C ().
• char *add_root(char *any)
adds a root to an un-rooted tree in any except bracket notation.
• char *unexpand_Full(char *expanded)
restores the bracket notation from an expanded full or HIT tree, that is any tree using only
identifiers 'u' 'P' and 'R'.
Chapter 2: Parsing and Comparing of Structures
7
• char *unweight (char *expanded)
strip weights from any weighted tree.
All the above functions allocate memory for the strings they return.
• void unexpand_aligned_F(char *align [2] )
converts two aligned structures in expanded notation as produced by tree_edit_distanceO
function back to bracket notation with ' _' as the gap character. The result overwrites the
input.
• void parse_structure (char *full)
Collects a statistic of structure elements of the full structure in bracket notation, writing to
the following global variables.
int loop_size []
contains a list of all loop sizes. loop_size [0] contains the number of external bases.
int loop_degree []
contains the corresponding list of loop degrees.
int helix_size []
contains a list of all stack sizes.
int loops
contains the number ofloops ( and therefore of stacks ).
int pairs
contains the number of base pairs in the last parsed structure. int unpaired
contains the number of unpaired bases.
Prototypes for the above functions can be found in RNAstruct. h.
2.3 Distance Measures
We can now define distances between structures as edit distances between trees or their string
representations. Given a set of edit operations and edit costs, the edit distance is given by the
minimum sum of the costs along an edit path converting one object into the other. The edit
operations used by us are insertion, deletion and replacement of nodes. String editing does not pay
attention to the matching of brackets, while in tree editing matching brackets represent a single
node of the tree. Tree editing is therefore usually preferable. String edit distances are always
smaller or equal to tree edit distances.
For full structures we use a cost of 1 for deletion or insertion of an unpaired base and 2 for a
Chapter 2: Parsing and Comparing of Structures
8
base pair. Replacing an unpaired base for a pair incurs a cost of 1.
Two cost matrices are provided for coarse grained structures:
/*
{
{
{
{
{
{
{
/*
Null,
H,
B,
I,
M,
0,
2,
2,
2,
2,
2,
0,
2,
2,
2,
2,
2,
0,
1,
2,
2,
2,
1,
0,
2,
2,
2,
2,
2,
0,
1, INF, INF, INF, INF,
1, INF, INF, INF, INF,
S,
1,
INF,
INF,
INF,
INF,
0,
INF,
Null,
H,
B,
I,
M,
S,
0, 100,
5,
5, 75,
5,
{ 100,
0,
8,
8,
8, INF,
{
5,
8,
0,
3,
8, INF,
{
5,
8,
3,
0,
8, INF,
{
75,
8,
8,
8,
0, INF,
{
INF,
INF,
INF,
INF,
5,
0,
{
5, INF, INF, INF, INF, INF,
{
E
n,
INF},
INF},
INF},
INF},
INF},
O},
E
5},
INF},
INF},
INF} ,
INF},
INF},
O},
*/
/*
/*
/*
/*
/*
/*
/*
*/
/*
/*
/*
/*
/*
/*
/*
E
replaced
replaced
replaced
replaced
replaced
replaced
replaced
*/
*/
*/
*/
*/
*/
*/
Null
H
B
I
M
S
E
replaced
replaced
replaced
replaced
replaced
replaced
replaced
*/
*/
*/
*/
*/
*/
*/
Null
H
B
I
M
s
The lower matrix uses the costs given in Shapiro (1990). All distance functions use the following
global variables:
• int cost_matrix
if 0, use the default cost matrix (upper matrix in example); otherwise use Shapiro's costs (lower
matrix).
• int edit_backtrack
produce an alignment of the two structures being compared by tracing the editing path giving
the minimum distance.
• char *aligned_line [2J
contains the two aligned structures after a call to one of the distance functions with
edit_backtrack set to 1. See Section 2.1 [notations], page 5, for details on the representation
of structures.
2.3.1 Functions for Tree Edit Distances
• Tree *make_treeCchar *xstruc)
constructs a Tree ( essentially the postorder list) of the structure xstruc, for use in
tree_edit_distanceO. xstruc may be any rooted structure representation.
Chapter 2: Parsing and Comparing of Structures
9
• float tree_edit_distance(Tree *Ti, Tree *T2)
calculates the edit distance of the two trees Ti and T2.
• void free_tree (Tree *t)
frees the memory allocated for t.
Prototypes for the above functions can be found in treedist .h.
2.3.2 Functions for String Alignment
• swString *Make_swString(char *xstruc)
converts the structure xstruc into a format suitable for string_edit_distanceO.
• float string_edit_distance(swString *Ti, swString *T2)
calculates the string edit distance of Ti and T2.
Prototypes for the above functions can be found in stringdist. h.
2.3.3 Functions for Comparison of Base Pair Probabilities
• float **Make_bp_profile(int length)
reads the base pair probability matrixpr (see Section 1.4 [Fold Params], page 3) and calculates
a profile, I.e. the probabilities of being unpaired, upstream, or downstream paired, respectively.
The returned array is suitable for profile_edit_distance.
• float profile_edit_distance(float **Ti, float **T2)
calculates an alignment distance of the two profiles Ti and T2.
• void free_profile(float **T)
frees the memory allocated for the profile T.
Prototypes for the above functions can be found in profiledist.h.
Chapter 3: Utilities
10
3. Utilities
The following utilities are used and therefore provided by the library:
• void PS_dot_plot(char *sequence, char *filename)
reads base pair probabilities produced by pf_foldO from the global array pr and the pair
list base_pair produced by foldO and produces a postscript "dot plot" that is written to
filename. The "dot plot" represents each base pairing probability by a square of corresponding
area in a upper triangle matrix. The lower part of the matrix contains the minimum free energy
structure.
• void PS_rna_plot(char *sequence, char *filename)
reads the pair list base_pair produced by foldO and produces a conventional secondary
structure graph in postscript, and writes it to filename.
• char *random_string(int 1, char *symbols)
generates a "random" string of characters from symbols with length 1.
• int hamming(char *sl, char *s2)
returns the number of positions in which sl and s2 differ, the so called "Hamming" distance.
• char *time_stampO
returns a string containing the current date in the format "Fri Mar 19 21:10:57 1993".
• void nrerror(char *message)
writes message to stderr and aborts the program.
• double umO
returns a pseudo random number in the range [0 ..1[, using a 48bit linear congruential method.
• unsigned short xsubi [3]
is used by urnO. These should be set to some random number seeds before the first call to
umO.
• int int_urn(int from, int to)
generates a pseudo random integer in the range [from, to].
• void *space(unsigned int size)
returns a pointer to size bytes of allocated and 0 initialised memory; aborts with an error if
memory is not available.
• char *get_line(FILE *fp)
reads a line of arbitrary length from the stream *fp, and returns a pointer to the resulting
string. The necessary memory is allocated and should be released using free 0 when the
string is no longer needed.
Chapter 4: A Small Example Program
11
4. A Small Example Program
This program folds two sequences using the mfe and partition function algorithms and calculates
the tree edit and profile distance of the resulting structures and structure ensembles.
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
<stdio.h>
lI u tils.h"
"fold_vars.hll
"fold.hll
ll
Il part_func.h
Ilinverse.h ll
lIRNAstruct .h ll
lIdist_vars.hll
Iltreedist.h ll
"stringdist.h"
"profiledist .h"
mainO
{
char *seql="CGCAGGGAUACCCGCG", *seq2="GCGCCCAUAGGGACGC",
*struct1, *struct2, *xstruc;
float el, e2, tree_dist, string_dist, profile_dist;
Tree *Tl, *T2;
sv8tring *81, *82;
float **pfl, **pf2;
temperature = 30.;
/* set temperature *before* initialising */
initialize_fold(strlen(seql));
structl = (char *) space(sizeof(char)*(strlen(seql)+l));
el = fold(seql, structl);
struct2 = (char *) space(sizeof(char)*(strlen(seq2)+1));
e2 = fold(seq2, struct2);
free_arrays () ;
xstruc = expand_Full(structl);
Tl = make_tree(xstruc);
81 = Make_sv8tring(xstruc);
free(xstruc);
xstruc = expand_Full(struct2);
T2 = make_tree(xstruc);
82 = Make_sv8tring(xstruc);
free(xstruc);
edit_backtrack = 1;
Chapter 4: A Small Example Program
12
tree_dist = tree_edit_distance(T1, T2);
free_tree(T1); free_tree(T2);
unexpand_aligned_F(aligned_line);
printf (nY.s\nY.s 1.3. 2f\nn, aligned_line [OJ, aligned_line [1] , tree_dist) ;
string_dist = string_edit_distance(51, 52);
free(51); free(52);
printf(nY.s\nY.s y'3.2f\nn, aligned_line [OJ , aligned_line[1J, string_dist);
init_pf_fold(strlen(seq1));
e1 = pf_fold(seq1, struct1);
pf1 = Make_bp_profile(strlen(seq1));
e2 = pf_fold(seq2, struct2);
pf2 = Make_bp_profile(strlen(seq2));
profile_dist
= profile_edit_distance(pf1,
printf (nY.s\nY.s
1.3. 2f\nn, aligned_line [OJ, aligned_line [1], profile_dist);
free_profile(pf1); free_profile(pf2);
}
pf2);
Chapter 5: References
13
5. References
M. Zuker, P. Stiegler (1981)
Optimal computer folding of large RNA sequences using thermodynamic and auxiliary information, Nucl Acid Res 9: 133-148
J.S. McCaskill (1990)
The equilibrium partition function and base pair binding probabilities for RNA secondary
structures, Biopolymers 29: 1105-1119
D.H. Turner, N. Sugimoto and S.M. Freier (1988)
RNA structure prediction, Ann Rev Biophys Biophys Chern 17: 167-192
J.A. Jaeger, D.H. Turner and M. Zuker (1989)
Improved predictions of secondary structures for RNA, Proc. Natl. Sci. USA 86: 7706-7710
(1989)
B.A. Shapiro, (1988)
An algorithm for comparing multiple RNA secondary structures, CABIOS 4, 381-393
B.A. Shapiro, K. Zhang (1990)
Comparing multiple RNA secondary structures using tree comparison, CABIOS 6, 309-318
W. Fontana, D.A.M. Konings, P.F. Stadler, P. Schuster (1993)
Statistics of RNA secondary structures, Biopolymers in press
- W. Fontana, P.F. Stadler, E.G. Bornberg-Bauer, T. Griesmacher, LL. Hofacker, M. Tacker, P.
Tarazona, E.D. Weinberger, P. Schuster (1993)
RNAfolding and combinatory landscapes, Phys. Rev. E 47: 2083-2099
D. Adams (1979)
The hitchhiker's guide to the galaxy, Pan Books, London
Download