From Sequences to Shapes and Back: A Case Study in RNA Secondary Structures Author(s): Peter Schuster, Walter Fontana, Peter F. Stadler and Ivo L. Hofacker Source: Proceedings: Biological Sciences, Vol. 255, No. 1344 (Mar. 22, 1994), pp. 279-284 Published by: The Royal Society Stable URL: http://www.jstor.org/stable/49949 . Accessed: 11/09/2014 16:50 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. . The Royal Society is collaborating with JSTOR to digitize, preserve and extend access to Proceedings: Biological Sciences. http://www.jstor.org This content downloaded from 142.58.160.201 on Thu, 11 Sep 2014 16:50:08 PM All use subject to JSTOR Terms and Conditions Fromsequencesto shapesand back: a case studyin RNA secondarystructures PETER SCHUSTER1' 23, WALTER AND IVO L. HOFACKER2 FONTANA3, PETER F. STADLER2 3 1 11, PF 100813,D-07708 Jena, Germany Beutenbergstrasse fir MolekulareBiotechnologie, Institut Wien,Austria Chemie,Universitdt Theoretische Institutfuir Santa Fe, U.S.A. 3Santa Fe Institute, 2 SUMMARY RNA foldingis viewed here as a map assigning secondary structuresto sequences. At fixed chain length the number of sequences far exceeds the number of structures.Frequencies of structuresare highly nonuniformand followa generalized formofZipf's law: we findrelativelyfewcommon and many rare ones. By using an algorithm for inverse folding, we show that sequences sharing the same structure are distributed randomly over sequence space. All common structurescan be accessed from an arbitrary sequence by a number ofmutationsmuch smaller than the chain length.The sequence space is percolated by extensiveneutral networksconnectingnearestneighboursfoldinginto identical structures.Implications for evolutionary adaptation and for applied molecular evolution are evident: finding a particular structureby mutation and selection is much simpler than expected and, even if catalytic activityshould turn out to be sparse in the space of RNA structures,it can hardly be missed by evolutionary processes. 1. INTRODUCTION Folding sequences into structuresis a central problem in biopolymer research. Both robustness and accessibility of structures,as functionsof mutational change in the underlyingsequence, are crucial to both natural and applied molecular evolution. Test-tube evolution experimentsare based on propertiesofRNA molecules: as sequences they are genotypes, and as spatial structures they are phenotypes (Spiegelman 1971; Biebricher 1983). Our concern is the mapping from RNA sequences into structuresbeing the simplest,and the only tractable, example of a genotype-phenotype mapping. An RNA sequence is a point in the space of all 4n sequences with fixedlength n. This space has a natural metric induced by point mutations interconverting sequences known as the Hamming distance (Hamming 1950, 1986). The foldingprocess considered here maps an RNA sequence into a secondary structure (figure 1a) minimizing free energy. A secondary structureis tantamount to a list of Watson-Crick type and GU base pairs, and can be represented as a tree graph (figure 1b). This emphasizes the combinatorial nature of secondary structures and allows for a canonical distance measure between structures (Tai 1979). Assuming elementaryedit operations with pre-defined costs, such as deletion, insertion and relabelling of nodes, the distance between two trees is given by the smallest sum of the edit costs along any path that converts one tree into the other (Sankoff & Kruskal 1983). Proc.R. Soc.Lood.B (1994) 255, 279-284 in GreatBritain Printed An approximate upper bound on the number of minimum free-energystructures(of fixed chain length n) can be obtained along the lines devised by Stein & Waterman (1978). Counting only those planar secondary structures that contain hairpin loops of size three or more (steric constraint), and that contain no isolated base pairs (stacks of two or more pairs are essentiallythe only stabilizing elements), one finds: n 1.4848 x n 2(1.8488), which is consistently smaller than the number of sequences. Folding can thus be viewed as a map between two metricspaces of combinatorial complexity,a sequence space and a shape space. (The notion of shape space was originally used in theoretical immunology in a similar context by Perelson & Oster (1979).) 'Shape' refers to a discretized (and hence coarse-grained) structurerepresentation,such as the secondary structures or the tree graphs used here. The notion of secondary structureis but one among a spectrum of possible levels of resolution that can be used to define shape. It discards atomic coordinates, as well as the relative spatial orientation of the structural elements, takinginto account only theirnumber,size and relative connectedness. Nevertheless, secondary structureis a major component of whatever turns out to be an adequate shape definition for RNA: it covers the dominant part of the three-dimensional folding energies, very often it can be used successfullyin the interpretation of function and reactivity, and it is frequentlyconserved in evolution (Sankoffet al. 1978; 279 This content downloaded from 142.58.160.201 on Thu, 11 Sep 2014 16:50:08 PM All use subject to JSTOR Terms and Conditions C) 1994 The Royal Society 280 structures P. Schuster and others RNA secondary (b) (a) 5~~~~~~~~~5 0~~~~~ ::: ,,,%<tXb>k)~~~ 00i on a sequenceis any patternofbase pairssuch thatno bases insidea loop pair Figure 1. (a) A secondarystructure elementsthatare: (i) base pairstacks; can be uniquelydecomposedintostructural tobasesoutsideit.Such a structure in size (numberofunpairedbases) and branchingdegree,i.e. hairpinloops (degreeone), internal (ii) loopsdiffering loops (degreetwoor more); and (iii) bases whichare notpartofa stackor a loop, termedexternal(freelyrotating additivelyto theoverallfreeenergyofthestructure jointsand unpairedends). Each stackorloop elementcontributes determinedparametersthatdepend on the nucleotidesequence.A minimumfree-energy accordingto empirically proposedbyZuker& Stiegler(1981) and Zuker& Sankoff(1984). accordingto an algorithm structure is constructed (b) A secondarystructuregraph (a) is equivalentto an orderedrootedtree.An internalnode (black) of the tree to one unpairednucleotide,and theroot to a base pair (twonucleotides),a leafnode (white)corresponds corresponds node (black square) is a virtualparentto the externalelements.Contiguousbase pair stackstranslateinto'ropes' a treeby firstvisitingthe rootthen ofinternalnodes,and loops appear as bushesofleaves. Recursivelytraversing visitingitssubtreesin lefttorightorder,finallyvisitingtherootagain,assignsnumbersto thenodesin correspondence thepairedpositions.) to the 5'-3' positionsalong thesequence. (Internalnodesare assignedtwonumbersreflecting Konings & Hogeweg 1989; Le & Zuker 1990), sometimes together with a few tertiary interactions (Cech 1988). This suggests that many of the relevant intermolecular interactions that collectively set a natural scale for shape are indeed stronglyinfluenced by the secondary structure.The observation, then, is that- at least in the present case - the shape space is considerably smaller than the sequence space. (We remark that this is also true for protein models on lattices.) 2. FREQUENCIES FOLDING OF SHAPES AND INVERSE Frequencies of occurrence for individual shapes in sequence space were obtained from large samples derived by folding random sequences of fixed chain length. Ranking according to decreasing frequencies yields a distributionwhich obeys a generalized Zipf's law (figure2). We are thus dealing with relativelyfew common shapes and many rare ones. How are the sequences which foldinto the same shape distributedin sequence space? This distributionis evaluated with a heuristicinverse folding procedure, aimed at devising sequences that foldinto an arbitrarypre-definedtarget shape (Hofacker etal. 1993). The obvious firststep is to construct a compatible test sequence with nucleotide assignments such that the target shape is indeed a possible secondary structure,although typically not a minimum free-energyone. We choose at random among the many compatible sequences. The next step is to decompose the minimum energystructureon the chosen testsequence into substructures,and to mutate by trial and error the corresponding subsequences. When the individual substructuresare as in the target, the entire sequence is reassembled. The procedure stops if the reassembled sequence folds into the target shape. This happens in about 50O ofthe cases. Several sequences that foldinto the same structureare sampled by starting the procedure with differentcompatible sequences. The average number of mutations that converted a random compatible sequence of chain length n 100 into one with the desired target shape was 7.2. The resultingensemble ofcompatible sequences that fold into a pre-definedtargethas been analysed forthe target being the secondary structureof t-RNAPheand forthreerandomly constructedexamples. In each case about 1000 sequences were derived by the inverse folding algorithm. The distribution of pairwise dis- Proc.R. Soc.Lond.B (1994) This content downloaded from 142.58.160.201 on Thu, 11 Sep 2014 16:50:08 PM All use subject to JSTOR Terms and Conditions RNA secondary structuresP. Schuster and others 281 10-1 10-2 ,>1) 3 0.08 r ty i \ -~0.04 104X 10-4 0.00 10-7 1 10 1000 480~~~ 100000 _______________ 10-5 1 10 100 1000 10000 rank Figure 2. The frequencydistributionof RNA secondary structures.Shapes are ranked by their frequencies.The particularexampleshownheredeals withtheloop structures (Shapiro & Zhang 1990) of 105 RNA moleculesof chain length100 whichare derivedfromsecondarystructures by further coarsegrainingthateliminatesall detailsconcerning stacklengthsand loop sizes.The diagramcovers97 % ofthe totalfrequency. The frequencies followa generalizedformof Zipf'slaw:f(x) = a(b+x)->, withx beingtherankofa shape andf(x) itsfrequency. Parametervalues ofthebestfit(thin curve) are a = 1.25, b = 71.2 and c = 1.73. The frequency distribution offullsecondarystructures is essentially thesame as shownin theinsertforchainlength30. Computationofthe forlongerchainsis hardlypossibleas thenumber distribution ofstructures exceedsbyfartheavailablecapacities(thereare ofchain lengthn = about 7 x 1023fullsecondarystructures 100). tances is not distinguishablefromthe one expected for random compatible sequences. The properties of the sequence sample as seen by statisticalgeometry(Eigen etal. 1988 a) and split-decomposition(Bandelt & Dress 1992) yield the same result: sequences foldinginto the same structureare randomly distributedin the space of compatible sequences. 3. STRUCTURE DENSITY 80 *~60 E 40 20 0 20 40 60 distance structure Figure 3. (a) The structuredensitysurface(SDS) forRNA sequences of length n = 100 (upper). This surface was sequence obtainedas follows:(i) choosea randomreference and computeitsstructure;(ii) samplerandomlytendifferent sequencesin each distanceclass (Hammingdistance1-100) sequence,and bin the distancesbetween fromthe reference This procedure structure. theirstructures and the reference was repeatedfor 1000 random referencesequences. Convergenceis remarkablyfast; no substantialchanges were sequences. observedwhendoublingthenumberofreference This procedureconditionsthe densitysurfaceto sequences and does not, withbase compositionpeaked at uniformity, therefore,yield informationabout stronglybiased compositions.(b) Contourplot of theSDS. 0 SURFACES Generalizing the previous question we ask how the possible shapes are distributed over the possible sequences. One insightis provided by considering the probability density (Fontana etal. 1993 a, b) P(tlIh) of two structuresbeing at (tree) distance t, given that the underlying sequences are at (Hamming) distance h. This structure density surface (SDS) shows how the distribution of structure differenceschanges as the sequences become more and more uncorrelated with increasing Hamming distance from the reference (figure3 presentsthe SDS forsequences of chain length n= 100). Three observations are immediate: (i) although forvery small Hamming distances (h = 1,2, 3) the most probable structuresare identical or very similar, there is none the less some probability that even a single mutation substantially alters the structure; (ii) beyond distance h = 3, identical or even closely related structuresare extremelyunlikely; and (iii) in the range 15 <Ah < 20, the density becomes independent of A,thus approaching essentiallywhat is expected for a sample of randomly drawn sequences (A 75). The latter suggeststhat the structuresof a reference sequence and its mutants at distances between 15 and 20 or larger are effectivelyuncorrelated. This suggests that memory of the referencestructureis sufficiently lost to allow the mutants at that distance to acquire any frequentminimum energystructure,at least in its essential features.From the SDS the complete structure autocorrelation functioncan be recovered (Fontana et al. 1993 a). This function is to a reasonable approximation a single decaying exponential with a characteristiclength,I = 7.6 in the presentcase (chain length n = 100). From figure3 it is seen that this corresponds essentiallyto the distance at which the dominant peak resultingfromidentical or very similar structureshas disappeared. 4. SHAPE SPACE COVERING Combination of the previous results (showing the existence of relatively few common shapes which are minimum free-energystructures for sequences randomly distributed in sequence space) with the in- Proc. R. Soc. Lond. B (1994) This content downloaded from 142.58.160.201 on Thu, 11 Sep 2014 16:50:08 PM All use subject to JSTOR Terms and Conditions 282 structures P. Schuster and others RNA secondary 0.25 that two arbitrarilychosen bases of an RNA sequence can forma base pair is given by the number of pairings divided by the number of possible combinations of two 0.20 bases: 6/16 = 3/8 (as we have six classes of base pairs: AU, GC, GU and inversions). The mean number of 0.15 bases that have to be changed in a random sequence, to forma sequence which is compatible with the target 0.10 structure (representingthe lower bound), is obtained from the probability not to form a base pair by 0.05 multiplication with the mean number of base pairs: (1 - 3) xnBp = 5iRBp. For RNA molecules of chain 0.00 0 20 80 100 40 60 length 100, the mean number of base pairs is 24.34 ofneutral length path,h (Fontana et al. 1993 a), and we obtain a mean Figure4. Neutralpaths.A neutralpathis definedby a series Hamming distance of 15.2 for the lower bound. From of nearest neighboursequences that fold into identical the probabilitydensityof the number of base pairs, we structures. Two classesof nearestneighboursare admitted: derive a distributionof the lower bound also shown in neighboursof Hammingdistance1, whichare obtainedby figure 4 (open diamonds). The characteristic neighof thestructure, singlebase exchangesin unpairedstretches bourhood has a radius of 15 < he < 20. and neighboursofHammingdistance2, resulting frombase pair exchanges in stacks. Two probabilitydensitiesof Hamming distances are shown that were obtained by searchingforneutralpathsin sequencespace: (i) an upper 5. NEUTRAL PATHS THROUGH SEQUENCE bound fortheclosestapproachoftrialand targetsequences SPACE (open circles) obtained as endpoints of neutral paths approachingthe targetfroma randomtrialsequence (185 The structure of the RNA shape space over the targetsand 100 trialsforeach wereused); (ii) a lowerbound sequence space is complemented by a second computer forthe closestapproachof trialand targetsequences(open experiment. We search for neutral paths with monodiamonds) derived from secondary structurestatistics tonouslyincreasingdistance froma referencesequence. (Fontana etal. 1993a; see thispaper, ?4); and (iii) longest A neutral path ends when no sequence that formsthe distances between the referenceand the endpoints of same structureis found among the nearest neighbours. monotonouslydivergingneutralpaths (filledcircles) (500 The probability densityof the lengthsof these paths is reference sequenceswereused). shown in figure4 (filled circles). The vast extensionof the networkof neutral paths came as a surprise: 21.7 00 formation from the SDS (showing the existence of a of all paths percolate the entiresequence space and end transitionfromlocal to global features)provides strong in a sequence which has not a single base in common evidence forthe existence of a neighbourhood (a highwith the reference.(The existence of extensive neutral dimensional ball) around every random sequence that networks meets a claim raised by Maynard-Smith contains sequences whose structuresinclude almost all (1970) forprotein spaces that are suitable for efficient common shapes. evolution.) To verifythe prediction of a characteristic neighbourhood covering almost all common shapes we did a computer experiment. A target sequence is chosen at 6. DISCUSSION random. A second random sequence servesas an initial trialsequence, and its structureas a referencestructure. The existence of a ball with characteristic radius Next we search for a nearest neighbour of the trial around any random sequence within which almost all sequence that foldsinto the referencestructurebut lies common shapes are found (figure 5) is a robust closer to the target. If such a sequence is found, it is phenomenon of the mapping fromsequences into RNA accepted as the new trial sequence, and the procedure secondary structures. It depends on the ratio of is repeated until no furtherapproach to the target is sequences to structures.Changes in the base-pairing possible. The final Hamming distance to the target is alphabet, in particular the considerationofpure GC or an upper bound for the minimum distance between pure AU sequences, may cause minor alterations that two sequences foldinginto the referencestructureand can be interpretedby much smaller values of thisratio the structureofthe target,respectively.The probability as well as by differencesin the topology of sequence density of this upper bound to the closest approach space. Alphabet dependencies will be published elsedistances determined for RNA molecules of chain where. The major featuresof the shape space structure length 100 is shown in figure4 (open circles). It yields depend on the generic properties of RNA folding, in a mean value of 19.8. (It is remarkable that this value particular on the non-local nature of base pairing, but coincides with the critical Hamming distance at which they are insensitiveto the empirical energyparameter we observe the change fromlocal to global featuresin sets used in folding algorithms, as well as to the the SDS; analogous investigationson RNA molecules distance measure between structuresused in the 5D5 with only GC or only AU base pairs have shown that (essentially the same SDSS were obtained with a now the precise agreement is not generally valid.) superseded parameter set (Fontana et al. 1991), and We can also compute a lower bound for the mean also with a different structure distance measure value of the closest approach distance. The probability (Hogeweg & Hesper 1984; Huynen et al. 1993)). Proc. R. Soc. Lond. B (1994) This content downloaded from 142.58.160.201 on Thu, 11 Sep 2014 16:50:08 PM All use subject to JSTOR Terms and Conditions RNA secondary structures P. Schusterand others 283 space sequence shapespace Figure5. A sketchofthemappingfromsequencesintoRNA as derivedhere.Anyrandomsequence secondarystructures is surroundedby a ball in sequence space which contains The sequencesfoldinginto (almost)all commonstructures. radius of thisball is much smallerthan the dimensionof sequencespace. There are caveats to our approach. 1. We use a thermodynamic criterion for RNA structureformation,not one that mimics the kineticsof folding. This does not constitute a problem for short sequences up to a fewhundred nucleotides. Moreover, the number of possible kineticstructuresis not entirely differentfrom the number of thermodynamic structures,the principles of base pairing are the same, and thus the generic featuresof mappings of sequences into kineticor thermodynamicstructureswill be essentially the same, too. 2. We consider only a single minimum free-energy structure for each sequence. Our approach can be carried over to ensembles comprising optimal and suboptimal foldings,as representedby partition functions (McCaskill 1990; Bonhoefferetal. 1993). 'Shape', then, becomes a matrix of temperature-dependent base-pairing probabilities, and the concept of distance is changed accordingly. All qualitative featuresof the SDS remain essentially unchanged, and numerical correctionsare in the range of 100 (as, forexample, in the case of correlationlengths (Bonhoefferetal. 1993)). 3. We do not consider three-dimensionalstructure. Nevertheless, the secondary structure defines an informativescale of resolution. In addition, it constitutes an approximation to a coarse-grained spatial structure (current algorithmsfor the modelling of RNA threedimensional structuresstartfromsecondary structures, and introduce a few tertiaryinteractions (Major et al. 1991, 1993)). The consequences of our results for natural and artificial selection are immediate. We predict that thereis no need to search systematicallyhuge portions of the sequence space. In the particular example of RNA molecules of chain length 100, the characteristic ball contains some 1027 sequences, which is only a fractionof i0" of the entire sequence space. Almost all structuresare within reach of a few mutations from a compatible sequence (average: 7.2), and even in reasonable proximityof any non-compatible random sequence (~ 18). The conclusion is thus that optimization of structuresby evolutionary trial and error strategies is much simpler than is often assumed. It provides furthersupport to the idea of widespread applicability of molecular evolution (Eigen 1971; Eigen & Schuster 1979; Eigen etal. 1988 b, 1989). The existence of networksof neutral paths percolating the entire sequence space has strong implications for (molecular) evolution in nature, as well as in the laboratory. Populations replicating with sufficiently high error rates will readily spread along these networks and can reach more distant regions in sequence space. If one were to design the ultimate evolvable molecule that carries informationand is engaged in functional interactions,it would ideally require two features: (i) capability of driftingacross sequence space without the necessityof changing shape; and (ii) proximityto any common shape everywhere. These are precisely the features that statistically characterize the mapping fromRNA sequence to secondary structure. The workreportedhere was supportedfinanciallyby the ForAustrianFonds zur Fbrderungder wissenschaftlichen schung (Projects S 5305-PHY, P 9942-PHY, and P8526MOB), the European Economic Community(Research StudyContractPSS*03960) and a National Science Foundation (Washington)core grantto the Santa Fe Institute Leo Bussand Professor (PHY-9021437). We thankProfessor and Dipl.AndreasDress forcommentson the manuscript, forsupplyingthedata forthe Chem. ErichBornberg-Bauer insertoffigure2. REFERENCES Bandelt, H.-J. & Dress, A. W. M. 1992 A canonical theoryformetricson a finiteset.Adv.Math. decomposition 92, 47-105. Biebricher, C. K. 1983 Darwinian selection of selfreplicatingRNA molecules.Evol.Biol. 16, 1-52. S., McCaskill,J.S., Stadler,P. F. & Schuster,P. Bonhoeffer, landscapes.A studybased on 1993 RNA multi-structure dependentpartitionfunctions.Eur. Biophys. temperature J. 22, 13-24. of Cech, T. R. 1988 Conservedsequencesand structures groupI introns:Buildingan activesiteforRNA catalysis - a review. Gene73, 259-271. ofmatterand theevolution Eigen,M. 1971 Selforganization 58, of biological macromolecules. Naturwissenschaften 465-523. - a principleof Eigen, M. & Schuster, P. 1979 The hypercycle Berlin: Springer-Verlag. naturalself-organization. R. & Dress, A. 1988a Eigen, M., Winkler-Oswatitsch, Statisticalgeometryin sequence space: A method of quantitativecomparativesequence analysis. Proc. natn. Acad. Sci. U.S.A. 85, 5913-5917. Eigen, M., McCaskill,J. & Schuster,P. 1988b Molecular quasi-species.J. phys.Chem.92, 6881-6891. Eigen,M., McCaskill,J.& Schuster,P. 1989 The molecular Phys.75, 149-263. quasi-species.Adv.chem. T., Schnabl,W., Stadler,P. F. & Fontana,W., Griesmacher, Schuster,P. 1991 Statisticsof landscapesbased on free energies,replicationand degradationrate constantsof Mh. Chem.122, 795-819. RNA secondarystructures. Fontana,W., Konings,D. A. M., Stadler,P. F. & Schuster, P. 1993a Statisticsof RNA secondarystructures.Bio33, 1389-1404. polymers E. G., GriesFontana, W., Stadler,P. F., Bornberg-Bauer, Proc. R. Soc. Lond. B (1994) Vol 255 20 This content downloaded from 142.58.160.201 on Thu, 11 Sep 2014 16:50:08 PM All use subject to JSTOR Terms and Conditions B 284 structures P. Schuster and others RNA secondary macher,T., Hofacker,I. L., Tacker, M., Tarazona, P., Weinberger,E. D. & Schuster,P. 1993b RNA folding and combinatory landscapes.Phys.Rev.E 47, 2083-2099. Hamming,R. W. 1950 Errordetectingand errorcorrecting Maynard-Smith, J. 1970 Natural selection and the concept of a protein space. Nature,Lond. 225, 563-564. McCaskill, J. S. 1990 The equilibrium partition function and base pair binding probabilities for RNA secondary structures.Biopolymers 29, 1105-1119. 147-160. codes. Bell Syst. tech.J. 29, Perelson, A. S. & Oster, G. F. 1979 Theoretical studies on theory, 2nd edn. Hamming, R. W. 1986 Codingand information clonal selection: minimal antibody repertoire size and EnglewoodCliffs:PrenticeHall. reliability of self-non-selfdiscrimination. J. theor.Biol. 81, L. Hofacker,I. L., Fontana,W., Stadler,P. F., Bonhoeffer, 645-670. S., Tacker, M. & Schuster,P. 1993 Fast foldingand Sankoff, D., Morin, A.-M. & Cedergren, R.J. 1978 The Mh. Chem.(In comparisonof RNA secondarystructures. evolution of 5S RNA secondary structures.Can. J. Bzochem. the press.) 56, 440-443. Hogeweg,P. & Hesper,B. 1984 Energydirectedfoldingof Sankoff, D. & Kruskal, J. B. (ed.) 1983 Time warps,string RNA sequences. Nucl. Acids Res. 12, 67-74. the theoryand practiceof sequence edits and macro-molecules. Huynen,M. A., Konings,D. A. M. & Hogeweg, P. 1993 London: Addison Wesley. comparisons. propertiesofRNA Multiplecodingand the evolutionary Shapiro, B. A. & Zhang, K. 1990 Comparing multiple Biol. 165, 251-267. J. theor. secondarystructure. RNA secondary structures using tree comparisons. Konings,D. A. M. & Hogeweg,P. 1989 Patternanalysisof Computer Appl. Biosci. 6, 309-318. RNA secondarystructure.Similarityand consensusof Spiegelman, S. 1971 An approach to the experimental minimal-energy folding.J. molec.Biol. 207, 597-614. analysis of precellular evolution. Q. Rev. Biophys. 4, Le, S.-Y. & Zuker, M. 1990 Common structuresof the 213-253. and rhinoviruses. 5' non-codingRNA in enteroviruses Stein, P. R. & Waterman, M. S. 1978 On some new Thermodynamicalstabilityand statisticalsignificance. sequences generalizing the Catalan and Motzkin numbers. J. molec.Biol. 216, 729-741. DiscreteMaths 26, 261-272. Major, F., Turcotte, M., Gautheret,D., Lapalme, G., Tai, K. 1979 The tree-to-treecorrection problem. J. Ass. Fillion,E. & Cedergren,R. 1991 The combinationof Comput.Mach. 26, 422-433. symbolic and numerical computation for three- Zuker, M. & Sankoff,D. 1984 RNA secondary structures dimensional modeling of RNA. Science,Wash. 253, and their prediction. Bull. Math. Biol. 46, 591-621. 1255-1260. Zuker, M. & Stiegler, P. 1981 Optimal computer foldingof Major, F., Gautheret,D. & Cedergren,R. 1993 Relarge RNA sequences using thermodynamicand auxiliary structureof a tRNA producingthe three-dimensional information.Nucl. AczdsRes. 9, 133-148. Proc.natn.Acad.Sci. moleculefromstructuralconstraints. U.S.A. 90, 9408-9412. Received 25 November 23 December 1993 1993; accepted Proc.R. Soc.Lond.B (1994) This content downloaded from 142.58.160.201 on Thu, 11 Sep 2014 16:50:08 PM All use subject to JSTOR Terms and Conditions