Fast Folding and Comparison of RNA Secondary Structures Ivo L. Hofacker Walter Fontana Peter F. Stadler L. Sebastian Bonhoeffer Manfred Tacker SFI WORKING PAPER: 1993-07-044 SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. ©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. www.santafe.edu SANTA FE INSTITUTE Fast Folding and Comparison of RNA Secondary Structures (The Vienna RNA Package) By Ivo L. L. HOFACKER a ,*, WALTER FONTANA c , PETER F. STADLERa,c, SEBASTIAN BONHOEFFERd , MANFRED TACKER a AND PETER SCHUSTER a,b,c aInstitut fur Theoretische Chemie der Universitiit Wien A-lOgO Wien, Austria bInstitut fur Molekulare Biotechnologie D-07745 Jena, Germany CSanta Fe Institute Santa Fe, NM 87501, U.S.A. dDepartment of Zoology, University of Oxford South Parks Road, Oxford OX1 3PS, UK *Mailing Address: Institut fur Theoretische Chemie der Universitiit Wien WiihringerstraBe 17, A-lOgO Vienna, Austria Email: ivo~otter.itc.univie.ac.at Phone: **43 1 40 480 678, Fax: **43 1 40 28 525 HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Abstract Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities. An efficient heuristic for the inverse folding problem of RNA is introduced. In addition we present compact and efficient programs for the comparison of RNA secondary structures based on tree editing and alignment. All computer codes are written in ANSI C. They include implementations of modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences are. 1. Introduction Recent interest in RNA structures and functions was caused by their catalytic capacities (Cech & Bass 1986, Symons 1992) as well as by the success of selection methods in producing RNA molecules with perfectly taylored properties (Beaudry & Joyce 1992, Ellington & Szostak 1990). Conventional structure analysis concentrates on natural molecules and closely related variants which are accessible by site directed mutagenesis. Several current projects are much more ambitious (particularly encouraged by ready availability of random RNA sequences) and aim at the exploration of sequencestructure relations in full generality (Fontana et ai. 1991, 1993a,b). The new approach turned out to be successful on the level of RNA secondary structures. In order to be able to do proper statistics millions of structures derived from arbitrary sequences have to be analyzed. In addition folding of long sequences becomes more and more important as well. Both tasks call for fast and efficient folding algorithms available on conventional sequential -1- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs computers as well as on parallel machines. The need arises to compare the performance of sequential and parallel implementations in order to provide information for the conception of optimal strategies for given tasks. The inverse folding problem is one of several new issues brought up by recent developments in rational design of RNA molecules: given an RNA secondary structure, which are the RNA sequences that form this structure as a minimum free energy structure. The information about many such "structurally neutral" sequences is the basis for tayloring RNA molecules which are suitable candidates for multi-functional molecules. More and more sequence data becoming currently available call for efficient comparisons either directly or on the level of their minimum free energy structures. Conventional alignment techniques are supplemented by new approaches like statistical geometry (Eigen et al. 1988) and split decomposition (Bandelt & Dress 1992). In this paper we introduce a package for computation, comparison and analysis of RNA secondary structures and properties derived from them, the Vienna RNA Package. The core of the package consists of compact codes to compute either minimum free energy structures (Zuker & Stiegler 1981, Zuker & Sankoff 1984) or partition functions of RNA molecules (McCaskill 1990). Both use the idea of dynamic programming originally applied by Waterman (Waterman 1978, Waterman & Smith 1978, Nussinov & Jacobson 1980). Non-thermodynamic criteria of structure formation like maximum matching (the maximal number of base pairs; Nussinov et al. 1978) or various versions of kinetic folding (Martinez 1984) can be applied as alternative options. An inverse folding heuristic is implemented to determine sets of structurally neutral sequences. A statistics package is included which contains routines for cluster analysis, statistical geometry, and split decomposition. This core is now avaliable as library as well as a set of stand alone routines. -2- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs In a forthcoming version the package will include routines for secondary structure statistics (Fontana et al. 1993b), statistical analysis of RNA folding landscapes as well as sequence-structure maps (Schuster et al. 1993). Further options will be available for RNA melting kinetics, in particular for the computation of melting curves and their first derivatives (Tacker et al. 1993). Extensions of the package provide access to computer codes for optimization of RNA secondary structrues according to predefined criteria, as well as simulations of molecular evolution experiments in flow reactors (Fontana & Schuster 1987, Fontana et al. 1989). In section 2 we present the core codes for folding as well as some 1/0routines that can be used for stand-alone applications of the folding programs. Section 3 introduces a variant of the folding program which is suitable for implementation on a parallel computer with hypercube architecture. Section 4 is dealing with the inverse folding problem. Section 5 decribes codes for comparing RNA secondary structures as well as base-pairing matrices derived from partition functions. Basic to our routines is a tree representation of RNA secondary structures introduced previously (Fontana et al. 1993b). Some examples of selected applications of the Vienna RNA package are given in section 6. 2. RNA Folding Programs A secondary structure on a sequence is a list of base pairs i, j with i < j such that for any two base pairs i,j and k, I with i ::; k holds: i=k {=} j=l k<j ===> i<k<l<j (1) The first condition implies that each nucleotide can take part in not more that one base pair, the second condition forbids knots and pseudoknots. The latter restriction is necessary for dynamic programming algorithms. A base -3- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs pair k,l is interior to the base pair i,j, if i < k < I < j. It is immediately interior if there is no base pair p, q such that i < p < k < I < q < j. For each base pair i, j the corresponding loop is defined as consisting of i, j itself, the base pairs immediately interior to i, j and all unpaired regions connecting these base pairs. The energy of the secondary structure is assumed to be the sum of the energy contributions of all loops. (Note that a stacked basepair constitutes a loop of zero size.) As a consequence of the additivity of the energy contributions, the minimum free energy can be calculated recursively by dynamic programming (Waterman 1978, Waterman & Smith 1978, Zuker & Stiegler 1981, Zuker & Sankoff 1984). Experimental energy parameters are available for the contribution of an individual loop as functions of its size, of the type of its delimiting basepairs, and partly of the sequence of the unpaired strains. These are usually measured for T = 37°C and 1M sodium chloride solutions (Freier et al. 1986, Jaeger et al. 1989). For the base pair stacking the enthalpic and entropic contributions are known separately. Contributions from all other loop types are assumed to be purely entropic. This allows to compute the temperature dependence of the free energy contributions: .6.Gstac k = .6.H37 ,stack - T .6.S37 ,stack .6.G\oop = - T .6.S37 ,loop (2) We use a recent version of the parameter set published by Freier et al. (1986), which was supplied in an updated version by Danielle Konings. In the current implementation we do not consider dangling ends. The essential part of the energy minimization algorithm is shown in table 1. The structure (list of base pairs) leading to the minimum energy is usually retrieved later on by "backtracking" through the energy arrays. -4- uu_~ HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Table 1: Pseudo Code of the minimum free energy folding algorithm. for(d=1. .. n) for(i=1. .. d) j=i+d C[i ,j] = MIN( Hairpin(i,j), MIN( i<p<q<j : Interior(i,j;p,q)+C[p,q] ), MIN( i<k<j : FM[i+1,k]+FM[k+1,j-1]+cc ) ) F[i,j] = MIN( C[i,j], MIN(i<k<j : F[i,k]+F[k+1,j])) FM[i,j]= MIN( C[i,j]+ci, FM[i+1,j]+cu, FM[i,j-1]+cu, MIN( i<k<j : FM[i,k]+FM[k+1,j] ) ) free_energy = F[1,n] Remark. F [i ,j] denotes the minimum energy for the subsequence consisting of bases i through j. C[i,j] is the energy given that i and j. pair. The array FM is introduced for handling multiloops. The energy parameters for all loop types except for multiloops are formally subsumed in the function Int~rior(i,j jp,q) denoting the energy contribution of a loop closed by the two base pairs i-j and p-q. We have assumed that multi-loops have energy contribution F=cc+ci*I+cu*U, where I is the number of interior base pairs and U is the number of unpaired digits of the loop. The time complexity here is O(n 4 ). It is reduced to O(n 3 ) by restricting the size of interior loops to some constant, say 30. The partition function for the ensemble of all possible secondary structures can be calculated analogously 6.G(S) Q= e RT (3) all structures S (McCaskill 1990). A pseudocode is given in table 2. Clearly the algorithm of table 2 does not predict a secondary structure, but one can calculate the probability P k1 for the formation of a base pair (4) p .. 'J Ecc· Eci i,i i<k<l<i b Q kl i,i i<k<l<i [E cu k-i-1Qm l+l,j-l + Qmi+l,k-l ECU j-l-l + Qmi+l,k-l Qml+l,j-l ] , where the symbols are defined in the caption of table 2. In this case the backtracking has time complexity O(n 3 ) just as the calculation of the partition function itself. -5- _ HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Table 2: Pseudocode for the calculation of the partition function. for(d=1 ... n) for(i=1. .. d) j=i+d QB[i,j] = EHairpin(i,j) + SUM( i<p<q<j : Elnterior(i,j;p,q)*QB[p,q] ) + SUM( i<k<j : QM[i+1,k-1]*QM1[k,j-1]*Ecc ) QM[i,j] = SUM( i<k<j : (Ecu·(k-i)+QM[i,k-1])*QM1[k,j] ) QM1[i,j]= SUM( i<k<=j : QB[i,k]*Ecu·(j-k)*Eci ) Q[i,j] = 1 + QB[i,jJ + SUM( i<p<q<j : Q[i,p-1]*QB[p,q] ) partition_function = Q[1,n] Remark. Here Ex:=exp( -x / RT) denotes the Boltzmann weights corresponding to the energy contribution x. Q[i,j] denotes the partition function Qij of the subsequence i through j. The array QM contains the partition function Q~j of the subsequence subject to the fact that i and j form a base pair. QM and QM1 are used for handling the multiloop contributions. x y means x'. For details see McCaskill (1990). Both folding algorithms have been integrated into a single interactive program including postscript output of the minimum energy structure and the base pairing probability matrix. The program requires 6 n 2 bytes of memory for the minimum energy fold and 10 n 2 bytes for the calculation of the partition function on machines with 32 bit integers and single precision floating points. In order to overcome overflows for longer sequences we rescale the partition function of a subsequence of length £ by a factor (lin, where Qis a rough estimate of the order of magnitude of the partition functions: - 37.)) Q- -_ exp (-184.3 + 7.27(T RT The performance of the algorithms reported here (5) IS compared with Zuker's (Zuker 1989) more recent program mfold 2.0 (available via anonymous ftp from nrcbsa.bio.nrc.ca) which computes suboptimal structures together with the minimum free energy structure in table 3. The computation of the minimum free energy structure including the entire matrix of base pairing probabilities is considerably faster with the present package (although we -6- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs do not provide information on individual suboptimal structures). Secondary structures are represented by a string of dots and matching parentheses, where dots symbolize unpaired bases and matching parentheses symbolize base pairs. An example is seen in the sample session shown in in figure 1. Table 3: Performance of implementations of folding algorithms. CPU time is measured on a SUN SPARC 2 Workstation with 32M RAM. Data are for random sequences. n 100 200 300 500 690 CPU time per folding [5] mfold 2.0 RNAfold 1.0 MFE MFE+PF 2.0 6.2 24.5 10.9 34.7 129.6 32.2 97.4 354.4 1258.3 96.6 312.4 228.1 743.9 3105.1 Because of the simplifications in the energy model and the uncertainties in the energy parameters predictions are not always as accurate as one would like. It is, therefore, desirable to include additional structural information from phylogenetic or chemical data. Our minimum free energy algorithm allows to include a variety of constraints into the secondary structure prediction by assigning bonus energies to structures honoring the constraints. One may enforce certain base pairs or prevent bases from pairing. Additionally, our algorithm can deal with bases that have to pair with an unknown pairing partner. A sample session is described in figure 2. -7- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs tram> RIAfold -T 42 -pi Input string (upper or lo~er case); G to quit ......... 1 2 3 4 UUGGAGUACACAACCUGUACACUCUUUC length 28 5 6 7 . = UUGGAGUACACAACCUGUACACUCUUUC .. CCCCC .. CCC ... ») .. »») ... minimum free energy = -3.71 .. CCCCC[[C" ... ») .. »») ... free energy of ensemble = -4.39 frequency of mfe structure in ensemble 0.337231 o u.u.G.G.~.O.UAC.A.C.A.~.C.C.u.G.U.A.c.~.c.u.c.u.u.u.c.c:: g: .. :: ::.::.::.:: : :.:.:: : :';)4'-:':'::.:.:.:~ 0::: .. :::: ::::: :.:::: :.: :.'. ... '.':':':0> o' • • • • • • . . . . • '.' •• '.' '.' . ········0 ::;.' . • . .. . •••••..•••.• '.' ••••• 'c .( rna.ps dot.ps Figure 1: Interactive example run of RNAfold for a random sequence. When the base pairing probability matrix is calculated the symbols. , I { } ( ) are used for bases that are essentially unpaired, weakly paired, strongly paired without preferred direction, weakly upstream (downstream) paired, and strongly upstream (downstream) paired, respectively. Apart from the console output the two postscript files rna. ps and dot. ps are created. The lower left part of dot. ps shows the minimum energy structure, while the upper right shows the pair probabilities. The area of the squares is proportional to the binding probability. 3. Parallel Folding Algorithm. We provide an implementation of the folding algorithm for parallel computers. In the following we present a parallelized version of the minimum energy folding algorithm for message passing systems. Since all subsequences of equal length can be computed concurrently, it is advisable to compute the -8- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF Input string (upper or ......... 1 lo~er 2 RNAs case); G to quit 3 4 5 6 7 . CACUACUCCAAGGACCGUAUCUUUCUCAGUGCGACAGUAA .«•...... « ».. 11 length = 40 CACUACUCCAAGGACCGUAUCUUUCUCAGUGCGACAGUAA .«««•. ««(. .... »») ... ») ..... ») .. minimum free energy = 0.83 a) CACUACUCCAAGGACCGUAUCUUUCUCAGUGCGACAGUAA ««•.... (((((. »») »» . minimum free energy = -1.52 b) Figure 2: a) Example Session of RNAfold -c. The constraints are provided as a string consisting of dots for bases without constraint, matching pairs of round brackets for base pairs to be enforced, the symbols '<' and '>' for bases that are paired upstream and downstream, respectively, and the pipe symbol' I' denoting a paired base with unknown pairing partner. b) shows minimum free energy structure without contraints for comparison. arrays F, C and FM (defined in table 1) in diagonal order, dividing each subdiagonal into P pieces. Figure 3 shows an example for 4 processors. Our algorithm stores the arrays F and FM both as columns and rows, while the C array is stored in diagonal order. The maximal memory requirement occurs at d = n/2, where we need n 2 /(2P) integers each for F and FM, while the array C needs only O(n) storage. Since the length of the rows and columns increases, one needs to reorganize the storage after each diagonal. If one allocates twice the minimum memory, storage has to be reorganized only once and the total requirement is the same as for the sequential algorithm. After completing one subdiagonal each processor has to either send a row to or receive a column from its right neighbour, and it has to either receive a row from or send a column to its left neighbour. Since we do not store the entire matrices, we cannot do the usual backtracking to retrieve the structure corresponding to the minimum energy. Instead, we write for each index pair (i,j) two integers to a file, which identify -9- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Figure 3: Representation of memory usage by the parallel folding algorithm. The triangle representing the triangular matrices F, C, and FM, respectively, is divided into sectors with an equal number of diagonal elements, one for each processor. The computation proceeds from the main diagonal towards the upper right corner. The information needed by processor 2 in order to calculate the elements of the dashed diagonal are highlighted. To compute its part of the dashed diagonal processor 2 needs the horizontally and vertically striped parts of the arrays F and FM, and the shaded part of the array c. The shaded part does not extend to diagonal, because we have restricted the maximal size of interior loops. the term that actually produced the minimum. The backtracking can then be done with O(n) readouts. All in all we need O(n) communication and I/O steps each transferring O(n) integers, while the computational effort is O(n3 ). The communication overhead therefore becomes negligible for sufficiently long sequences. On the Intel Ipse/2 the advantage of storing F and FM in rows and columns outweighs the I/O overhead for sequences longer than some 200 nucleotides. The efficiency of the parallelization as a function of sequence length and number of processors can be seen in figure 4. The partition function algorithm can be parallelized analogously. -10 - HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs .= '''•......'. ' .... ........... ~ ~ 20 _ . -'-OL-----~----~~---~-----" 1 2 4 Number of Processors 8 16 Figure 4: Performance of parallel algorithm for random sequences of length 50 0, 100 0, 200 <>, 400 6, 700 <l, 1000 '7. Efficiency is defined as speedup divided by the number of processors. The dotted line is lIn corresponding to no speedup at all. 4. Inverse Folding Inverse folding is highly relevant for two reasons: (1) to find sequences that form a predefined structure under the minimum free energy criterium, and (2) to search for sequences whose Boltzmann ensembles of structures match a given predefined structure more closely than a given threshold. The first aspect is directly addressed by the inverse folding algorithm presented here. The second issue is somewhat more realistic. In the case of compatible sequences, that is sequences which can form a base pair wherever it is required in the target structure, we shall often find the desired structure among suboptimal foldings with free energies close to the minimal value. Matching base pairing probabilities can be approached by the same inverse folding heuristic using the base pairing matrices of the partition function algorithm rather than minimum free energy structures. -11- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Only compatible sequences are considered as candidates in the inverse folding procedure. Clearly, a compatible sequence can but need not have the target structure as its minimum free energy structure. Our basic approach is to modify an initial sequence 10 , such as to minimize a cost function given by the "structure distance" f(l) = d(S(I), T) between the structure S(I) of the test sequence 1 and the target structure T. A set of possible distance measures is discussed in section 5. The actual choice of this distance measure is not critical for the performance of the algorithm. This procedure reqUIres many evaluations of the cost functions and thereby many executions of the folding algorithm. Instead of running the optimization directly on the full length sequence, we optimize small substructures first, proceeding to larger ones as shown in the flow chart (figure 5). This reduces the probability of getting stuck in a local minimum, and more important, it reduces the number of foldings of full length sequences. This is possible because substructures contribute additively to the energy. If T is an optimal structure on the sequence S and contains the base pair (i,j) then the substructure T;,j must be optimal On the subsequence Si,j. It is likely then, but by nO means necessary, that the COnverse also holds: A structure that is optimal for a subsequence will also appear with enhanced probability as a substructure of the full sequence. For the actual optimization (denoted by 'FindStructure' in figure 5) we use the simplest possibility, an adaptive walk. In general, an adaptive walk will try a random mutation, and accept it iff the cost function decreases. A mutation consists in exchanging one base at positions that are unpaired in the target structure T, or in exchanging two bases, while retaining compatibility, if their corresponding positions pair in T. If no advantageous mutation can be found, the procedure stops, and restart again with a new initial string -12 - HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs HND HAIRPIN SUBSTRUCTURE:=HAIRPIN ELONGATESUBSTRUCTUBE BY I BP HND SUBSEQUENCE N Y EWNGATESUBSTRUCTURE TO INCLUDE EXTERIOR BP OF THE MULTILOOP ADD EXTERNAL BASES TO COMPONENT HND SUBSEQUENCE N CONCATENATE COMPONENTS HND SEQUENCE TO FULL STRUCTURE Figure 5: Flow chart of the inverse folding algorithm. 10 • Optionally, we can restrict the search to bases that do not pair correctly. This slightly increases the probability that no sequence can be found, but greatly reduced the search space. Sequences found by the inverse folding algorithm will often allow alternative structures with only slightly higher energies. To find sequences with -13 - HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs 1000 100 10 - T-40"(U100)A3.5 1SO·L----.J70,------c-10LO-----1..cSO,-----2.J0-,-0--2--'SO Length Figure 6: Performance of inverse folding. full line: T=4·10- 6 n 3 . 5 . clearly defined structures, the partition function algorithm can be used to maximize the probability peT) 1 = Q exp( -b..G(T)/RT) (6) of the desired structure in the ensemble. This procedure is much slower, since the optimization is done with the full length sequence. 5. Comparison of Secondary Structures RNA secondary structures can be represented as trees (Zuker & Sankofl:' 1984, Shapiro 1988, Shapiro & Zhang 1990). The tree representation was used, for example, to obtain coarse grained structures revealing the branching pattern or the relative positions of loops. More recently we proposed a tree representation at full resolution (Fontana et ai. 1993b). A secondary structure Sk is converted one to one into a tree Tk by assigning an internal node to each base pair and a leaf node to each unpaired digit (Figure 8). The -14 - HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs tram> RNAinverse -Fm -R -3 Input structure k start string (lover case letters for const positions) G to quit, and 0 for random start string ......... 1. 2 3 4 5 6 7 .....•... «««( .. ((((. o »». ««(. »») « «(. » »»»»» .... length = 76 CUAUACUACGAGGAUAAUCUGCCUUUUGCCAAAGAGGGUGGCAUUUCAUCAGCUCCGAAUGCUGAGGUAUAGCGAA 20 AGCUCUGAUAUCUCUACGAAUAGAUCCUUUAUAUCUCUUAAAGCGUGUCUGGAAGAUAACUCCAGCAGAGCUUGUG 25 UUCUCCUGUAGUCGACUUUAGGACUCGAGGCCGUAUUUGCCUCACGGAAAUGUUUACAAUGCAUUAGGAGGAGUGC 29 A tram> RNAinverse -Fp -R 3 Input structure t start string (lower case letters for const positions) G to quit, and 0 for random start string ......... 1. 2 ••..••••• 3 ••....••. 4 ..••••••• 5 6 7 ....••.•• «««(. .«« o »». ««(. »») « «(. »»»»»» . length = 76 GCUAGCGUUGGGCUUUUUUUCGCCCUGCCGCAAAACCCGCGGCUUCUCGCUACAUCUCUCGUAGCCGCUAGCAAAA 50 (0.844786) GCGUUACAAGCGCAAUCCCCCGCGCAGCGUCAAAACCCGACGCCAACAGCUACAAAACCCGUAGCGUAACGCAAAA 55 (0.859519) GCGCGCCAAGCGCAAAAAAAAGCGCAGCCGCAAAACACGCGGCAAAAAGCGGCAGAAAAAGCCGCGGCGCGCAAAA 49 (0.85046) B Figure 7: Sample session of RNAinverse. A: Minimum free energy_ B: Partition func- tion. Data on the natural tRNAphe sequence for comparison: the clover leaf structure occurs only with a probability of 0.172511 and with even smaller probabilities for the three sequences found by the RNAinverse -Fm (0.08917I7, 0.0329602, and 0.03357I3, respectively). conversion starts with a root which does not correspond to a physical unit of the RNA molecule. It is introduced to prevent the formation of a tree forest for RNA structures with external elements (For details of the interconversion of secondary structures and trees see also Fontana et at. 1993b). As shown in (Fontana et at. 1993b) the trees T k can be rewritten as homeomorphica1ly irreducible trees (HITs), see appendix B. The transformation from the full tree to the HIT retains complete information on the structure. Secondary structure, full tree as well as HIT are equivalent. Tree editing induces a metric in the space of trees (see appendix A), and in the space of RNA secondary structures. A tree is transformed into another -15 - HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF A ,, , '" RNAs B .--.--.. ' Figure 8: Interconversion of secondary structures and trees. A secondary structure graph (A) is equivalent to an ordered rooted tree (B). An internal node (black) of the tree corresponds to a base pair (two nucleotides), a leaf node (white) corresponds to one unpaired nucleotide, and the root node (black square) is a virtual parent to the external elements. Contiguous base pair stacks translate into "ropes" of internal nodes and loops appear as bushes of leaves. Recursively traversing a tree by first visiting the root, then visiting its subtrees in left to right order, finally visiting the root again, assigns numbers to the nodes in correspondence to the 5'-3' positions along the sequence (Internal nodes are assigned two numbers reflecting the paired positions). tree by a series of editing operations with predefined costs (Tai 1979, Sankoff & Kruskal 1983). The distance between two trees d(Tj, Tk) is the smallest sum of the costs along an editing path. The parameters used in our tree editing are summarized in appendix A. The editing operations preserve the relative traversal order of the tree nodes. Tree editing can therefore be viewed as a generalization of sequence alignment. In fact, for trees that consist solely of leaves, tree editing becomes the standard sequence alignment. A sample -16 - HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs session computing the tree distance between two arbitrarily chosen structures is shown in figure 9. An alternative graphical method for the comparison of RNA secondary structures (Hogeweg & Hesper 1984, Konings 1989, Konings & Hogeweg 1989) encodes secondary structures as linear strings with balanced parentheses representing the base pairs, and some other symbol coding for unpaired positions. Tree representations in full resolution make it often difficult to focus on the major structural features of RNA molecules since they are often overloaded with irrelevant details. Coarse-grained tree representations were invented previously to solve this problem (Shapiro 1988). tram> RXAdistance -DfhvcHWC -B Input structure; Q to quit ......... 1 «.«««( .....«« .. «« 26 (-----(. «(--«« f: h: 2 »»»).» (-«( .. «« 3 4 S « .. «« »».» . »».»)-)- --.--«( »».»» «(. »). -----» »-»).» « .. «« 6 7 . »».». -»)-.--- 32 (----«Ul) «US-)P7) (Ul)P2) (U4) «U2)«US)P4) (Ul)P2) (Ul) Rl) «US) «U2) «Ul0)P4)(Ul)P4)(US)-----«U4)P3) (Ul)-------Rl) v: 34 «««HS-)S7)I2)S2)««HS)S4)I3)S2)ES-)Rl) «««Hl0)S4)I3)S4)--«H4)S3)------Ell)Rl) c: 3 ««(Hl)Sl)Il)Sl)««Hl)Sl)Il)Sl)Rl) ««(Hl)Sl)Il)Sl)--«Hl)Sl)------Rl) H: 31 (Rl------(P2(U1Ul) (P7(USUS)P7) (U1Ul)P2) (U4U4) (P2(U2U2) (P4(USUS)P4) (U1Ul)P2) (U1Ul)Rl) (Rl(USUS) (P4(U2U2) (P4(U1Ul)P4) (U1Ul)P4) (USUS)--------- (P3(U4U4)P3)---------(U1Ul)Rl) W: 32 (Rl(ES-(S2(I2(S7(HSHS)S7)I2)S2)(S2(I3(S4(HSHS)S4)I3)S2)ES)-Rl) (Rl(Ell(S4(I3(S4(H1Hl)S4)I3)S4)------(S3(H4H4)S3)------Ell)Rl) C: 3 (R(S(I(S(HH)S)I)S)(S(I(S(HH)S)I)S)R) (R(S(I(S(HH)S)I)S)----(S(HH)----S)R) Figure 9: Interactive sample session of RNAdistance. For this example we have used two random sequences folded by RNAfold. -17- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Base pairing probability matrices may also be compared by an alignment-type method (Bonhoeffer et ai. 1993). Since a secondary structure is representable as a string (see figure 9), comparison of structures can be done by standard string alignment algorithms (e.g. Waterman 1984). This approach has been generalized to structure ensembles in (Bonhoeffer et ai. 1993). We compute for each position i of the sequence the probability to be upstream paired, downstream paired or unpaired. p~ = LPij j>i (7) pl = LPij j<i The probability that the base at position i is unpaired is pi = 1 - P~ - p}. A reasonable definition for the distance of two such vectors, p(Sl) and p(S2), uses again an alignment procedure at the level of the vectors p<, p) and pO. We then define the distance measure for an aligned position (i, j) by (8) and 5( i) = 0 for inserted or deleted positions. The distance of two structure ensembles is given by the minimum total edit costs as in an ordinary string alignment (The numerical value of this distance is twice the distance measure defined in Bonhoeffer et ai. 1993). -18 - HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs 6. Applications 6.1. Long RNA Molecules 16sRNA, yeast. Chain length: 1542 nucleotides. Performance: 42min on an IBM-RS6000!550 with 64 Mbyte. Table 4: Free energies of 16sRNA from yeast. The phylogenetic structure decomposes into six components. Bases 1 557 884 921 1397 1498 - - MFE 556 883 920 1396 1497 1542 ~ 1 - 1542 -130.63 -89.18 -6.55 -111.66 -21.00 -15.11 -374.13 -379.22 phylogenetic structure "as is" completed -77.14 -97.76 -70.64 -61.31 -6.55 -3.76 -37.68 -68.70 -18.15 -18.15 -14.11 -13.91 -211.95 -275.91 -279.48 -211.95 Rem.ark. The fourth component, bases 921-1396, shows the largest deviations. This region contains large multiloops. It seems that in general local interactions are predicted more reliably than long range base pairs. Minimum free energy structure contains 259 (58%) of the 441 base pairs predicted by a phylogenie structure (5 uncommon base pairs, and the 5 pairs forming a pseudo-knot are not counted). The cummulated pairing probability of the bases in the phylogenetic structure is about 54%. The cummulated pairing probability of the bases in the minimum free energy structure is 73%. The phlogenetic structure "as is" amounts to a free energy of -212kcaljmol as compared to -379kcaljmol for the minimum free energy structure. The difference of more than 300 RT indicates that the phylogenetic structure cannot be complete. We did a minimum free energy folding of -19 - HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs the sequence subject to the base pairs prescribed by the phylogenetic structure using RNAfold -c. One finds -279.5kcal/mol. This is still far away from the minimum free energy structure. The omitted interactions due to the pseudoknot and the uncommon base pairs cannot account for more than some 20 kcal/mol in the worst case. We are aware of the following possible explanations for the gap of about 100 RT: • The energy model for secondary structures is totally wrong. • 16sRNA is selected for its biochemical function which is not preformed by the free RNA, but by its complex with protein. The phylogenetically determined "structure" may then be completely different from the minimum free energy structure. Q(3 RNA. Chain length: 4220 nucleotides. Performance: 9.5h on an IBM-RS6000/S60 with 256 Mbyte. Folding of long RNA sequences can be extended up to chain lengths of about 5000 nucleotides. The entire Q(3-genome (Figure 10) was folded as a test case of an example of an entire viral genome. We do not imply that the real secondary structure of Q(3 RNA is identical with its minimum free energy structure. Kinetic effects are highly important for long sequences. The secondary structure obtained appears to be partitioned into three parts: (0-861), (862-3003), and (3004-4220). The middle part contains a large loop and other gross features which are also revealed by electron microscopy (Jacobson 1991). - 20- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs . ., . , ~ .' . -. ,. / , / .- Figure 10: The base pairing probabilities for the entire Qf3 genome (4420 nucleotides). 6.2. Heat Capacity of RNA Secondary Structures The heat capacity can be obtained from the partition function using EPG Cp = -T 8T2' and t:.G = -RTlnQ (9) The numerical differentiation is performed in the following way: We fit the function F( x) by the least square parabola y = cx 2 2m + 1 equally spaced points Xo - mh, Xo - (m -l)h, - 21- + bx + a through the ..., Xo, ..• , Xo + mho HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs 4 o t::::::::::::.~~~~~~--~~~,---::::="J o 20 40 60 80 100 120 Temperature [C] Figure 11: Heat Capacity of the tRNA-phe from yeast calculated using RNAheat -Tmax 120 -h 0.1 -m5. The second derivative of F is then approximated by F"(xo) = 2c. Explicitly, we obtain F"(xo) = f k=-m 2 30 (3k - m(m + 1)) F(xo 2 m(4m -l)(m + 1)(2m + 3) + kh) (10) As an example we show the heat capacity of tRNAtextphe from yeast (Figure 11). 6.3. Landscapes and combinatory maps Statistical features of the relations between sequences, structures and other properties of RNA molecules were studied in several previous papers (Fontana et al. 1991, 1993a,b; Bonhoeffer et al. 1993, Tacker et al. 1993). We mention here free energy landscapes and structure maps as two representative examples. Relations between sequences I and derived objects 9(1) (which may be numbers like free energies, or structures, or trees, or anything that - 22- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs might be relevant in studying RNA molecules) are considered as mappings from one metric space (the sequence space X) into another metric space (Y), (11) The existence of a distance measure d g inducing a metric on the set of objects !I(I) is assumed. In the trivial case Y is JR. We are then dealing with a "landscape" which assigns a value to every sequence (a free energy, or an activation energy, for example). In general both spaces will be of combinatorial complexity and we coined the term "combinatory maps" for these mappings. In order to facilitate illustration and collection of data for statistical purposes a probability density surface is computed which expresses the conditional probability that two arbitrarily chosen sequences I Hamming distance dh(i,j) = h form two objects i I j of !Ii and !lj at a distance dg(i,j) = 'Y: Pg('Ylh) - Prob (dg(i,j) = 'Y I dh(i,j) = h) . (12) In the case of free energies we have 'Y = Ifj - fil, in case of the structure density surface (SDS) (Fontana et al. 1993b, Schuster et al. 1993) 'Y is the tree distance between the two structures obtained from the two sequences Ii and Ij. An example of a typical probability density surface of RNA secondary structures is shown in figure 12. These structure density surfaces may be used to compute various quantities. Autocorrelation functions, for example, are obtained by a straightforward calculation (Fontana et al. 1993a,b). - 23- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs 0. 10 100 80 "ac: <.> ~ '" '"c: °E 'i5 60 ~ E a :c 20 40 Structure distance 60 Figure 12: The structure density surface (SDS) for RNA sequences of length n=lOO (upper). This surface was obtained as follows: (1) choose a random reference sequence and compute its structure, (2) sample randomly 10 different sequences in each distance class (Hamming distance 1 to 100) from the reference sequence, and bin the distances between their structures and the reference structure. This procedure was repeated for 1000 random reference sequences. Convergence is remarkably fastj no substantial changes were observed when doubling the number of reference sequences. This procedure conditions the density surface to sequences with base composition peaked at uniformity, and does, therefore, not yield information about strongly biased compositions. Lower part: Contour plot of the SDS. - 24- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs II II Figure 13: Schema of the folding of an RNA hybrid. 6.4. RNA Hybridisation The folding algorithm can be generalized in a straightforward way to compute the structure resulting from the hybridisation of several RNA strands. This is done by concatenating the, say, N strands in all possible permutations. Any such permutation is folded almost as usual. The only difference concerns "loops" or a "stacked pair" which contain end-toend junctions of individual strands, and which are treated as appropriate external elements. The structure with the least free energy among all optimal structures corresponding to the N! sequence permutations is the optimal hybrid structure. In Figure 13 we show an example for N = 2. - 25- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs 6.5. Other applications We mention investigations based on other criteria for RNA folding as, for example, maximum matching of bases or various kinetic folding algorithms which can be readily incorporated into the package. RNA melting kinetics has been studied as well by the statistical techniques mentioned here (Tacker et al. 1993). In continuation of previous work (Fontana & Schuster 1987, Fontana et al. 1989) we started simulations experiments in molecular evolution which aim at a better understanding of evolutionary optimization processes on a molecular scale. Studies of this kind are becoming relevant for the design of efficient experiments in applied molecular evolution (Eigen & Gardiner 1984, Kauffman 1992). Acknowledgements The work reported here was supported financially in part by the Austrian Fonds zur Forderung der wissenschaftlichen Forschung (Project No. S 5305PHY and Project No. P 8526-MOB). The Institut fur Molekulare Biotechnologie e.V. is financed by the Thuringer Ministerium fur Wissenschaft und Kunst and by the German Bundesministerium fUr Forschung und Technologie. The Santa Fe Institute is sponsored by core funding from the U.S. Department of Energy (ER-FG05-88ER25054) and the National Science Foundation (PHY-8714918). Computer facilities were supplied by the computer centers of the Technical University and the University of Vienna. - 26- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs References Bandelt, H.-J. & A.W.M.Dress. (1992) Canonical Decomposition Theory of Metrics on Finite Sets. Adv.Math. 92, 47-105. Beaudry, A.A. & G.F.Joyce. (1992) Directed evolution of an RNA enzyme. Science 257, 635-641. Bonhoeffer S., J.S.McCaskill, P.F.Stadler & P. Schuster. (1993) Temperature dependent RNA landscapes. A study based on partition functions. Eur.Biophys.J. 22, 13-24. Cech, T.R. & B.L.Bass. (1986) Biological catalysis by RNA. Ann.Rev.Biocbem. 55, 599-629. Eigen, M. & W.Gardiner. (1984) Evolutionary molecular engineering based on RNA replication. Pure & App1. Cbem. 56, 967-978. Eigen, M., R.Winkler-Oswatitsch & A.Dress. (1988) Statistical geometry in sequence space: A method of quantitative comparative sequence analysis. Proc.Nat1.Acad.Sci. USA 85, 5913-5917. Ellington, A.D. & J.W.Szostak. (1990) In vitro selection of RNA molecules that bind to specific ligands. Nature 346, 818-822. Fontana W. & P.Schuster. (1987) A computer model of evolutionary optimization. Biopbys.Chem. 26, 123-147. Fontana W., W.Schnabl & P.Schuster. (1989) Physical aspects of evolutionary optimization and adaptation. Pbys.Rev.A 40, 3301-3321. Fontana W., T.Griesmacher, W.Schnabl, P.F.Stadler & P.Schuster. (1991) Statistics of landscapes based on free energies, replication and degradation rate constants of RNA secondary structures. Mb.Chem. 122, 795-819. Fontana W., P.F.Stadler, E.Bornberg-Bauer, T.Griesmacher, LL.Hofacker, M.Tacker, P.Tarazona, E.D.Weinberger & P.Schuster. (1993a) RNA folding and combinatory landscapes. Phys.Rev.E 47, 2083-2099. Fontana W., D.A.M.Konings, P.F.Stadler & P.Schuster. (1993b) Statistics of RNA secondary structures. Biopolymers, in press. - 27- - - - --- -- ------------------- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Freier, S.M., RKierzek, J.A.Jaeger, N.Sugimoto, M.H.Caruthers, T.Neilson & D.H.Turner. (1986) Improved free-energy parameters for predictions of RNA duplex stability. Proc.Natl.Acad.Sci. USA 83, 93739377. Hofacker LL., P.Schuster & P.F.Stadler. (1993) Combinatorics of RNA Secondary Structures. Preprint. Hogeweg, P. & B.Hesper. (1984) Energy Directed Folding of RNA Sequences. Nucleic Acids Res. 12, 67-74. Jacobson, A.B. (1991) Secondary structure of Coliphage QfJ RNA. Analysis by electron microscopy. J.Mol.Biol. 221, 557-570. Jaeger J.A., D.H.Turner & M.Zuker. (1989) Improved predictions of secondary structures for RNA. Proc.Natl.Acad.Sci. USA 86, 77067710. Kauffman, S.A. (1992) Applied molecular evolution. J. Theor.Biol. 157, 1-7. Konings, D.A.M. (1989) Pattern analysis of RNA secondary structure. Proefschrift, Rijksuniversiteit te Utrecht. Konings, D.A.M. & P.Hogeweg. (1989) Pattern analysis of RNA secondary structure. Similarity and consensus of minimal-energy folding. J.Mol.Biol. 207, 597-614. Martinez, H.M. (1984) An RNA Folding Rule. Nucleic Acids Res. 12, 323334. McCaskill, J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29, 1105-1119. Nussinov, R, G.Pieczenik, J.RGriggs & D.J.Kleitman. (1978) Algorithms for loop matching. SIAM J.Appl.Math. 35, 68-82. Nussinov, R & A.B.Jacobson. (1980) Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc.Natl.Acad.Sci. USA 77, 6309-6313. Sankoff, D. & J.B.Kruskal, eds. (1983) Time warps, string edits and macromolecules: The theory and practice of of sequence comparison. Addison Wesley, London. - 28- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Schuster, P., W.Fontana, P.F.Stadler & I. L. Hofacker. (1993) The distribution of RNA structures over sequences. Preprint. Shapiro, B. (1988) An algorithm for comparing multiple RNA secondary structures. CABIOS 4, 387-393. Shapiro, B. & K,Zhang. (1990) Comparing multiple RNA secondary structures using tree comparisons. CABIOS 6, 309-318. Symons, R.H. (1992) Small catalytic RNAs. Ann.Rev.Biochem. 61, 641-671. Tacker, M., W.Fontana, P.F.Stadler & P.Schuster. (1993) Statistics of RNA melting kinetics. Preprint. Tai, K, (1979) The Tree-to-tree Correction Probelem. J.ACM 26, 422433. Waterman, M.S. (1978) Secondary structure of single-stranded nucleic acids. In: Studies in Foundations and Combinatorics. Advances in Mathematics Suppl. Studies. Vol.1., pp.167-212. Academic Press, New York. Waterman, M.S. (1984) General Methods of Sequence Comparison. Bull.Math.Biol. 46, 473-500. Waterman, M.S. & T.F.Smith. (1978) RNA secondary structure: A complete mathematical analysis. Math.Biosc. 42, 257-266. Zuker, M. & P.Stiegler. (1981) Optimal computer folding of large RNA sequences using thermodynamic and auxilary information. Nucleic Acids Res. 9, 133-148. Zuker, M. & D.Sankoff. (1984) RNA secondary structures and their prediction. Bull.Math.Biol. 46, 591-621. Zuker, M. (1989) On finding all suboptimal foldings of an RNA molecule. Science 244, 48-52. - 29- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs Appendix A: Edit Distances and Edit Costs Let x, y, z be rooted ordered plane trees with labelled and weighted nodes. The labels are taken from some alphabet A, the weights are nonnegative real numbers. Denote the set of all such trees by TA. (For the relation of trees and secondary structures see figure 5 and appendix B.) We define elementary edit costs oa for insering or deleting a vertex with label a E A and O'ba = O'ba for replacing a vertex with label a by a vertex with label b. We require > > < o o (Tac and O'aa = 0 V a E A V a#b V a,b,c E A + Ucb (A.l) The last line implies, that if substitution of a and b is allowed, than its cost is at most the cost for first deleting a and then inserting b. Denote a vertex with label a E A and weight v ::::: 0 by [a, v]. We then define the weighted edit costs: .6.([a, v]) = vOa E([a,v], [b,w]) = min(v,w)O'ab + Iv - wi {~: if if v::::: w v :::; w (A.2) for insertion/deletion and substitution, respectively. The tree edit distance dT(X, y) of any two trees x, y E TA is defined as the minimum total edit cost needed to transform x into y, such that the partial order of the tree remains untouched. For a description of tree editing in general, and of its implementation as a dynamic programming algorithm see e.g. Tai (1979) and Shapiro (1990). It is easy to see that dT is a metric on TA for any alphabet A. Any tree x E TA can be transformed into a linear parenthesized representation by appropriately traversing the tree. (A leaf is represented by both an open and closed parentehsis.) This representation is string-like in the - 30- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs sense that one may interpret each parenthesis as a generalized character j of a string x. Each character is defined by its label, its weight, and its "sign", that is, whether it is an open or a closed parenthesis, e= [a, v, p] with a E A, v ::::: 0, and p ='(' or ')'. Its is easy to see that such a generalized string can be represented as a tree, provided it contains only matching parentheses and weigths and labels coincide for pairs of matching parentheses. We define the generalized string edit distance ds(x, jj) of two generalized strings x and jj as the minimum total edit cost needed to transform x into jj, where the edit costs for each step are given by 6.([a, v,p]) = .6.([a, v]) .f>([ ] [b ,w,q]) -_ {~([a, v], [b, w]) £..J a,v,p, 00 if if p =q (A.3) PI' q for insertion/deletion and substitution, respectively. This is to say that open parentheses may be matched with open parentheses only, and closed parentheses with closed parentheses only. Corresponding to the above two distance measures we have two types of alignments: In case of the edit distance for generalized strings it is straight forward. For the tree edit distance we have to transform the aligned trees into their linear representation in order to print them. This implies that each inserted or deleted node is complemented by gap characters in two positions, corresponding to the left and right bracket, respectively, and that matched nodes give rise to two matches in the linear representation. (See figure 9 for examples). As a consequence, each alignment of two trees x and y can be written as a valid alignment of the corresponding generalized string x and jj exhibiting the same edit cost. Hence ds(x, jj) ::; dT(X, y) - 31- for all x,y E TA (A.4) HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs In particular, let x be a subtree of y. Then d.(x, Ii) = dT(X, y). As a consequence there are arbitrarily large trees x and y such that the two distance measures coincide. The cost matrix we use is: 0 a 1 2 2 2 2 2 1 1 00 U 1 a P 2 1 1 a 00 00 00 00 H 2 B 2 I 2 M 2 S E R 1 1 00 0 00 00 00 00 00 00 00 U 00 00 00 00 00 00 00 2 2 1 2 2 2 00 00 00 P H 00 00 00 B 00 00 00 00 00 00 I M a a 00 00 00 00 2 2 2 00 00 00 00 00 00 a 00 00 00 00 00 00 00 00 a S 00 00 E 00 00 00 00 00 00 00 00 1 2 a 2 a a R Appendix B: Representation of Secondary Structures The simplest way of representing a secondary structure is the "parenthesis format" , where matching parentheses symbolize base pairs and unpaired bases are shown as dots. Alternatively, one may use two types of node labels, 'P' for paired and 'D' for unpaired; a dot is then replaced by '(D)', and each closed bracket is assigned an additional identifier 'P'. In (Fontana et al. 1993b) a condensed representation of the secondary structure is proposed, the so-called homeomorphically irreducible tree (HIT) representation. Here a stack is represented as a single pair of matching brackets labelled 'P' and weighted by the number of base pairs. Correspondingly, a contiguous strain of unpaired bases is shown as pair of matching bracket labelled 'D' and weighted by its length. Bruce Shapiro (1988) proposed another representation, which, however, does not retain the full information of the secondary structure. He represents the different structure elements by single matching brackets and labels them as 'H' (hairpins loop), 'I' (interiorloop), 'B' (bulge), 'M' (multi-loop), and'S' - 32- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs a) .((((..(((. .. ))) .. (( .. )))).)). (U)««(U)(U)««U)(U)(U)p)p)p)(U)(U)«(U)(U)p)p)p)p)(u)p)p)(U) b) (U)«(U2)«U3)P3)(U2)«U2)P2)P2)(U)P2)(U) c) « (H) (H)K)B) «««H)S)«H)S)K)S)B)S) «««(H)S)«H)S)K)S)B)S)E) d) «««(H3)S3)«H2)S2)K4)S2)B1)S2)E2) e) «H)(H)K) a) (UU)(p(p(p(p(UU)(UU)(p(p(p(UU) (UU) (UU)p)p)p) (UU)(UU) (p(p(UU)(U ... b) (UU)(P2(P2(U2U2)(P2(U3U3)P3) (U2U2) (P2(U2U2)P2)P2)(UU)P2 )(UU) c) (B(K(HH)(HH)K)B) (S(B(S(K(S(HH)S)(S(HH)S)K)S)B)S) (E(S(B(S(K(S(HH)S)(S(HH)S)K)S)B)S)E) d) (E2(S2(B1(S2(K4(S3(H3)S3)«H2)S2)K4)S2)B1)S2)E2) e) (K(HH)(HH)K) Figure B.l: Linear representations of secondary structures used by the Vienna RNA package. Above: Tree representations of secondary structures. a) Full structure: the first line shows the more convenient condensed notation which is used by our programs; the second line shows the rather clumsy expanded notation for completeness, b) HIT structure, c) different versions of coarse grained structures: the second line is exactly Shapiro's representation, the first line is obtained by neglecting the stems. Since each loop is closed by a unique stem, these two lines are equivalent. The third line is an extension taking into also the external digits. d) weighted coarse structure, e) branching structure. Below: The corresponding tree in the notation used for the output of the stringtype alignments. Virtual roots are not shown here. The program RNAdistance accepts any of the above tree representations, the string-type representations occur on output only. (stack). We extend his alphabet by an extra letter for external elements 'E'. Again the corresponding trees may be weighted by the number of unpaired bases or base pairs constituting them. An even more coarse grained representation considers the branching structure only. It is obtained by considering the hairpins and multiloops only (Hofacker et al. 1993). All tree representations (except for the condensed dot-bracket form) can be encapsulated into a virtual root (labeled 'R'), see also the example session in figure 6. In order to discriminate the alignments produced by tree-editing from the alignment obtained by generalized string editing we put the label on both - 33- HOFACKER ET AL.: FAST FOLDING AND COMPARISON OF RNAs sides in the latter case. Appendix C: List of Executables RNAfold [-p[oJJ [-T tempJ [-4J [-noGuJ [-0 o_sotJ [-CJ RNAdistanco [-D[fhwcFHlIcJJ [-X[plmlflcJJ [-sJ [-B [filoJJ RNApdist [-XpmfcJ [-B [filoJJ [-T tempJ [-4J [-noGuJ [-0 o_sotJ RNAhoat [-Tmin t1J [-Tmax t2J [-h stepsizoJ [-m ipointsJ [-4J [-noGuJ [-0 o_sotJ RNAinvorso [-F [mpJ J [-a ALPHABETJ [-R [ropoatsJJ [-T tempJ [-4J [-noGuJ RNAoval [-T tempJ [-4J [-0 o_sotJ [-0 o_sotJ Appendix D: Availability of Vienna RNA Package Vienna RNA package is public domain software. It can be obtained by anonymous ftp from the server ftp.itc.univie.ac.at, directory /pub/RNA as ViennaRNA-1. 03. tar .Z. The package is written in ANSI C and is known to compile on SUN SPARC Stations, IBM-RS/6000, HP-720, and Silicon Graphics Indigo. For comments and bug reports please send e-mail messages to ivo0itc.univie.ac.at. The parallelized version of the minimum free energy folding is not part of the package. It can be obtained upon request from the authors. - 34- Vienna RNA Package A Library for folding and comparing RNA secondary structures Copyright ©1993 Ivo Hofacker and Peter Stadler Chapter 1: Folding Routines 1 1. Folding Routines 1.1 Minimum free Energy Folding The library provides a fast dynamic programming minimum free energy folding algorithm as described by M. Zuker, P. Stiegler (1981) (see Chapter 5 [References], page 13). The energy parameters are taken from Turner et al. (1988) and Jaeger et al. (1989). If you want to try other energy parameters look at energy_par.h in the source files. Associated functions are: • float fold(char *sequence, char *structure) folds the sequence and returns the minimum free energy in Kcal/Mol; structure contains the mfe structure in bracket notation (see Section 2.1 [notations], page 5). The structure must be at least strlen(sequence) chars long. If fold_constrained (see Section 1.4 [Fold Params], page 3) is 1, the structure string is interpreted on input as a list of constraints for the folding. The characters" I x < > " mark bases that are paired, unpaired, paired upstream, or downstream, respectively; matching brackets" ( ) " denote base pairs, dots"." are used for unconstrained bases. • float energy_of_struct(char *sequence, char *structure) calculates the energy of structure on the sequence • void initialize_fold(int length) allocates memory for folding sequences not longer than length; sets up pairing matrix and energy parameters. Has to be called before the first call to fold (). • void free_arrays (void) frees the memory allocated by initialize_fold. • void update30ld_params(void) call this to recalculate the pair matrix and energy parameters after a change in folding parameters like temperature (see Section 1.4 [Fold Params], page 3). Prototypes for these functions are declared in fold.h. 1.2 Partition Function Folding Instead of the minimum free energy structure the partition function of all possible structures and from that the pairing probability for every possible pair can be calculated, using again a dynamic programming algorithm described by J.S. McCaskill (1990) (see Chapter 5 [References], page 13). The following functions are provided. Chapter 1: Folding Routines 2 • float pf_fold(char *sequence, char *structure) calculates the partition function of sequence and returns the free energy of the ensemble in Kcal/Mol. If structure is not a NULL pointer on input, it contains on return a string consisting of the letters " . , I { } ( ) " denoting bases that are essentially unpaired, weakly paired, strongly paired without preference, weakly upstream (downstream) paired, or strongly up- (down-)stream paired bases, respectively. Unless do_backtrack has been set to 0 (see Section 1.4 [Fold Params], page 3), pr[iindx[i]-j] contains the probability that bases i and j pair. • void init_pf_fold(int length) allocates memory for folding sequences not longer than length; sets up pairing matrix and energy parameters. Has to be called before the first call to pLfoldO. • void free_pf _arrays (void) frees the memory allocated by init_pf_foldO. • void update_pLparams(int length) Call this function to recalculate the pair matrix and energy parameters after a change in folding parameters like temperature (see Section 1.4 [Fold Params], page 3). Constrained folding has not been implemented for the partition function fold. Prototypes for these functions are declared in part_func .h. 1.3 Inverse Folding We provide two functions that search for sequences with a given structure, thereby inverting the folding routines. • float inverse_fold(char *start, char *target) searches for a sequence with minimum free energy structure target, starting with sequence start. It returns 0 if the search was successful, otherwise a structure distance to target is returned. The found sequence is returned in start. If give_up is set to 1, the function will return as soon as it is clear that the search will be unsuccessful, this speeds up the algorithm if you are ouly interested in exact solutions. • float inverse_pf_fold(char *start, char *target) searches for a sequence with maximum probability to fold into structure target using the partition function algorithm. It returns -kT log(p) where p is the frequency of target in the ensemble of possible structures. This is usually much slower than inverse_foldO. The global variable char symbolset [] contains the allowed bases. Chapter 1: Folding Routines 3 Prototypes for these functions are declared in inverse .h. 1.4 Global Variables for the Folding Routines The following global variables change the behaviour the folding algorithms or contain additional information after folding. • int noGU do not allow GU pairs if equal 1; default is O. • int no_closingGU if 1 allow GU pairs only inside stacks, not as closing pairs; default is O. • int tetra_loop include special stabilising energies for some tetra loops; default is 1. • int energy_set if 1 or 2: fold sequences from an artificial alphabet ABCD ... , where A pairs B, C pairs D, etc. using either GC (1) o~ AU parameters (2); default is O. • float temperature rescale energy parameters to a temperature of temperature C. Default is 37C. You have to call the update{*}paramsO functions after changing this parameter. • struct bond { int i, j ;} *base_pair Contains a list of base pairs after a call to foldO. base_pair [0] . i contains the total number of pairs. • float *pr contains the base pair probability matrix after a call to pLfoldO. • int *iindx index array to move through pro The probability for base i andj to form a pair is in pr[iindx[i]-j]. • float pf _scale a scaling factor used by pLfoldO to avoid overflows. Should be set to exp((FjkT)jlength) where F is an estimate for the ensemble free energy, for example the minimum free energy. You must call update_pLparamsO or init_pLfoldO after changing this parameter. If pLscaie is -1 (the default) , an estimate will be provided automatically when calling init_pLfoldO or update_pLparams O. • int fold_constrained calculate constrained minimum free energy structures. See Section 1.1 [mfe Fold], page 1, for more information. Default is 0; • int do _backtrack if 0, do not calculate pair probabilities in pLfoldO; this is about twice as fast. Default is 1. Chapter 1: Folding Routines 4 • char backtrack_type only for use by inverse3oldO; 'C': force (l,N) to be paired, 'M' fold as if the sequence were inside a multi-loop. Otherwise the usual mfe structure is computed. include fold_vars.h if you want to change any of these variables form their defaults. Chapter 2: Parsing and Comparing of Structures 5 2. Parsing and Comparing of Structures j 2.1 Representations of Secondary Structures The most simple way to represent a secondary structure is the "bracket notation", where matching brackets symbolise base pairs and unpaired bases are shown as dots. Alternatively, one may use two types of node labels, 'P' for paired and 'u' for unpaired; a dot is then replaced by '(U)', and each closed bracket is assigned an additional identifier 'P'. We call this the expanded notation. In (Fontana et al. 1993) a condensed representation of the secondary structure is proposed, the so-called homeomorphically irreducible tree (HIT) representation. Here a stack is represented as a single pair of matching brackets labelled 'P' and weighted by the number of base pairs. Correspondingly, a contiguous strain of unpaired bases is shown as one pair of matching brackets labelled 'u' and weighted by its length. Generally any string consisting of matching brackets and identifiers is equivalent to a plane tree with as many different types of nodes as there are identifiers. Bruce Shapiro (1988) proposed another representation, which, however, does not retain the full information of the secondary structure. He represents the different structure elements by single matching brackets and labels them as 'H' (hairpin loop), 'I' (interior loop), 'B' (bulge), 'M' (multiloop), and'S' (stack). We extend his alphabet by an extra letter for external elements 'E'. Again these identifiers may be followed by a weight corresponding to the number of unpaired bases or base pairs in the structure element. All tree representations (except for the dot-bracket form) can be encapsulated into a virtual root (labeled 'R'), see the example below. The following example illustrates the different linear tree representations used by the package. All lines show the same secondary structure. .«« .. «( ... ))) .. « .. )))).)). (U)««(U)(U)««U)(U)(U)P)P)P)(U)(U)«(U)(U)P)P)P)P)(U)P)P)(U) b) (U)«(U2)«U3)P3)(U2)«U2)P2)P2)(U)P2)(U) c) « (H)(H)M)B) «««H)S)«H)S)M)S)B)S) «««(H)S)«H)S)M)S)B)S)E) d) ««««H3)S3)«H2)S2)M4)S2)Bl)S2)E2)R) a) Above: Tree representations of secondary structures. a) Full structure: the first line shows the more convenient condensed notation which is used by our programs; the second line shows the rather clumsy expanded notation for completeness, b) HIT structure, c) different versions of coarse grained structures: the second line is exactly Shapiro's representation, the first line is obtained by neglecting the stems. Since each loop is closed by a unique stem, these two lines are equivalent. The Chapter 2: Parsing and Comparing of Structures 6 third line is an extension taking into account also the external digits. d) weighted coarse structure, this time including the virtual root. For the output of aligned structures from string editing, different representations are needed, where we put the label on both sides. The above examples for tree representations would then look like: a) (UU) (P(P(P(P(UU) (UU) (P(P(P(UU) (UU) (UU)P)P)P) (UU) (UU) (P(P(UU) (U ... b) (UU)(P2(P2(U2U2)(P2(U3U3)P3)(U2U2)(P2(U2U2)P2)P2)(UU)P2)(UU) c) (B(M(HH) (HH)M)B) (S(B(S(M(S(HH)S) (S(HH)S)M)S)B)S) (E(S(B(S(M(S(HH)S) (S(HH)S)M)S)B)S)E) d) (R(E2(S2(Bl(S2(M4(S3(H3)S3)((H2)S2)M4)S2)Bl)S2)E2)R) Aligned structures additionally contain the gap character' _'. 2.2 Parsing and Coarse Graining of Structures Several functions are provided for parsing structures and converting to different representations. • char *expand_Full(char *full) converts the full structure from bracket notation to the expanded notation including root. • char *b2HIT(char *full) converts the full structure from bracket notation to the HIT notation including root. • char *b2C(char *full) converts the full structure from bracket notation to the a coarse grained notation using the 'H' 'B' 'I' 'M' and 'R' identifiers. • char *b2Shapiro(char *full) converts the full structure from bracket notation to the weighted coarse grained notation using the 'H' 'B' 'I' 'M' 's' 'E' and 'R'identifiers. • char *expand_Shapiro(char *coarse) inserts missing'S' identifiers in unweighted coarse grained structures as obtained from b2C (). • char *add_root(char *any) adds a root to an un-rooted tree in any except bracket notation. • char *unexpand_Full(char *expanded) restores the bracket notation from an expanded full or HIT tree, that is any tree using only identifiers 'u' 'P' and 'R'. Chapter 2: Parsing and Comparing of Structures 7 • char *unweight (char *expanded) strip weights from any weighted tree. All the above functions allocate memory for the strings they return. • void unexpand_aligned_F(char *align [2] ) converts two aligned structures in expanded notation as produced by tree_edit_distanceO function back to bracket notation with ' _' as the gap character. The result overwrites the input. • void parse_structure (char *full) Collects a statistic of structure elements of the full structure in bracket notation, writing to the following global variables. int loop_size [] contains a list of all loop sizes. loop_size [0] contains the number of external bases. int loop_degree [] contains the corresponding list of loop degrees. int helix_size [] contains a list of all stack sizes. int loops contains the number ofloops ( and therefore of stacks ). int pairs contains the number of base pairs in the last parsed structure. int unpaired contains the number of unpaired bases. Prototypes for the above functions can be found in RNAstruct. h. 2.3 Distance Measures We can now define distances between structures as edit distances between trees or their string representations. Given a set of edit operations and edit costs, the edit distance is given by the minimum sum of the costs along an edit path converting one object into the other. The edit operations used by us are insertion, deletion and replacement of nodes. String editing does not pay attention to the matching of brackets, while in tree editing matching brackets represent a single node of the tree. Tree editing is therefore usually preferable. String edit distances are always smaller or equal to tree edit distances. For full structures we use a cost of 1 for deletion or insertion of an unpaired base and 2 for a Chapter 2: Parsing and Comparing of Structures 8 base pair. Replacing an unpaired base for a pair incurs a cost of 1. Two cost matrices are provided for coarse grained structures: /* { { { { { { { /* Null, H, B, I, M, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 1, 0, 2, 2, 2, 2, 2, 0, 1, INF, INF, INF, INF, 1, INF, INF, INF, INF, S, 1, INF, INF, INF, INF, 0, INF, Null, H, B, I, M, S, 0, 100, 5, 5, 75, 5, { 100, 0, 8, 8, 8, INF, { 5, 8, 0, 3, 8, INF, { 5, 8, 3, 0, 8, INF, { 75, 8, 8, 8, 0, INF, { INF, INF, INF, INF, 5, 0, { 5, INF, INF, INF, INF, INF, { E n, INF}, INF}, INF}, INF}, INF}, O}, E 5}, INF}, INF}, INF} , INF}, INF}, O}, */ /* /* /* /* /* /* /* */ /* /* /* /* /* /* /* E replaced replaced replaced replaced replaced replaced replaced */ */ */ */ */ */ */ Null H B I M S E replaced replaced replaced replaced replaced replaced replaced */ */ */ */ */ */ */ Null H B I M s The lower matrix uses the costs given in Shapiro (1990). All distance functions use the following global variables: • int cost_matrix if 0, use the default cost matrix (upper matrix in example); otherwise use Shapiro's costs (lower matrix). • int edit_backtrack produce an alignment of the two structures being compared by tracing the editing path giving the minimum distance. • char *aligned_line [2J contains the two aligned structures after a call to one of the distance functions with edit_backtrack set to 1. See Section 2.1 [notations], page 5, for details on the representation of structures. 2.3.1 Functions for Tree Edit Distances • Tree *make_treeCchar *xstruc) constructs a Tree ( essentially the postorder list) of the structure xstruc, for use in tree_edit_distanceO. xstruc may be any rooted structure representation. Chapter 2: Parsing and Comparing of Structures 9 • float tree_edit_distance(Tree *Ti, Tree *T2) calculates the edit distance of the two trees Ti and T2. • void free_tree (Tree *t) frees the memory allocated for t. Prototypes for the above functions can be found in treedist .h. 2.3.2 Functions for String Alignment • swString *Make_swString(char *xstruc) converts the structure xstruc into a format suitable for string_edit_distanceO. • float string_edit_distance(swString *Ti, swString *T2) calculates the string edit distance of Ti and T2. Prototypes for the above functions can be found in stringdist. h. 2.3.3 Functions for Comparison of Base Pair Probabilities • float **Make_bp_profile(int length) reads the base pair probability matrixpr (see Section 1.4 [Fold Params], page 3) and calculates a profile, I.e. the probabilities of being unpaired, upstream, or downstream paired, respectively. The returned array is suitable for profile_edit_distance. • float profile_edit_distance(float **Ti, float **T2) calculates an alignment distance of the two profiles Ti and T2. • void free_profile(float **T) frees the memory allocated for the profile T. Prototypes for the above functions can be found in profiledist.h. Chapter 3: Utilities 10 3. Utilities The following utilities are used and therefore provided by the library: • void PS_dot_plot(char *sequence, char *filename) reads base pair probabilities produced by pf_foldO from the global array pr and the pair list base_pair produced by foldO and produces a postscript "dot plot" that is written to filename. The "dot plot" represents each base pairing probability by a square of corresponding area in a upper triangle matrix. The lower part of the matrix contains the minimum free energy structure. • void PS_rna_plot(char *sequence, char *filename) reads the pair list base_pair produced by foldO and produces a conventional secondary structure graph in postscript, and writes it to filename. • char *random_string(int 1, char *symbols) generates a "random" string of characters from symbols with length 1. • int hamming(char *sl, char *s2) returns the number of positions in which sl and s2 differ, the so called "Hamming" distance. • char *time_stampO returns a string containing the current date in the format "Fri Mar 19 21:10:57 1993". • void nrerror(char *message) writes message to stderr and aborts the program. • double umO returns a pseudo random number in the range [0 ..1[, using a 48bit linear congruential method. • unsigned short xsubi [3] is used by urnO. These should be set to some random number seeds before the first call to umO. • int int_urn(int from, int to) generates a pseudo random integer in the range [from, to]. • void *space(unsigned int size) returns a pointer to size bytes of allocated and 0 initialised memory; aborts with an error if memory is not available. • char *get_line(FILE *fp) reads a line of arbitrary length from the stream *fp, and returns a pointer to the resulting string. The necessary memory is allocated and should be released using free 0 when the string is no longer needed. Chapter 4: A Small Example Program 11 4. A Small Example Program This program folds two sequences using the mfe and partition function algorithms and calculates the tree edit and profile distance of the resulting structures and structure ensembles. #include #include #include #include #include #include #include #include #include #include #include <stdio.h> lI u tils.h" "fold_vars.hll "fold.hll ll Il part_func.h Ilinverse.h ll lIRNAstruct .h ll lIdist_vars.hll Iltreedist.h ll "stringdist.h" "profiledist .h" mainO { char *seql="CGCAGGGAUACCCGCG", *seq2="GCGCCCAUAGGGACGC", *struct1, *struct2, *xstruc; float el, e2, tree_dist, string_dist, profile_dist; Tree *Tl, *T2; sv8tring *81, *82; float **pfl, **pf2; temperature = 30.; /* set temperature *before* initialising */ initialize_fold(strlen(seql)); structl = (char *) space(sizeof(char)*(strlen(seql)+l)); el = fold(seql, structl); struct2 = (char *) space(sizeof(char)*(strlen(seq2)+1)); e2 = fold(seq2, struct2); free_arrays () ; xstruc = expand_Full(structl); Tl = make_tree(xstruc); 81 = Make_sv8tring(xstruc); free(xstruc); xstruc = expand_Full(struct2); T2 = make_tree(xstruc); 82 = Make_sv8tring(xstruc); free(xstruc); edit_backtrack = 1; Chapter 4: A Small Example Program 12 tree_dist = tree_edit_distance(T1, T2); free_tree(T1); free_tree(T2); unexpand_aligned_F(aligned_line); printf (nY.s\nY.s 1.3. 2f\nn, aligned_line [OJ, aligned_line [1] , tree_dist) ; string_dist = string_edit_distance(51, 52); free(51); free(52); printf(nY.s\nY.s y'3.2f\nn, aligned_line [OJ , aligned_line[1J, string_dist); init_pf_fold(strlen(seq1)); e1 = pf_fold(seq1, struct1); pf1 = Make_bp_profile(strlen(seq1)); e2 = pf_fold(seq2, struct2); pf2 = Make_bp_profile(strlen(seq2)); profile_dist = profile_edit_distance(pf1, printf (nY.s\nY.s 1.3. 2f\nn, aligned_line [OJ, aligned_line [1], profile_dist); free_profile(pf1); free_profile(pf2); } pf2); Chapter 5: References 13 5. References M. Zuker, P. Stiegler (1981) Optimal computer folding of large RNA sequences using thermodynamic and auxiliary information, Nucl Acid Res 9: 133-148 J.S. McCaskill (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structures, Biopolymers 29: 1105-1119 D.H. Turner, N. Sugimoto and S.M. Freier (1988) RNA structure prediction, Ann Rev Biophys Biophys Chern 17: 167-192 J.A. Jaeger, D.H. Turner and M. Zuker (1989) Improved predictions of secondary structures for RNA, Proc. Natl. Sci. USA 86: 7706-7710 (1989) B.A. Shapiro, (1988) An algorithm for comparing multiple RNA secondary structures, CABIOS 4, 381-393 B.A. Shapiro, K. Zhang (1990) Comparing multiple RNA secondary structures using tree comparison, CABIOS 6, 309-318 W. Fontana, D.A.M. Konings, P.F. Stadler, P. Schuster (1993) Statistics of RNA secondary structures, Biopolymers in press - W. Fontana, P.F. Stadler, E.G. Bornberg-Bauer, T. Griesmacher, LL. Hofacker, M. Tacker, P. Tarazona, E.D. Weinberger, P. Schuster (1993) RNAfolding and combinatory landscapes, Phys. Rev. E 47: 2083-2099 D. Adams (1979) The hitchhiker's guide to the galaxy, Pan Books, London