ISBN 978-1-84626-093-3 Proceedings of the 5th International Congress on Mathematical Biology (ICMB2011) Vol. 3 Nanjing, P. R. China, June 3-5, 2011 A Novel Statistical Model for the Secondary Structure of RNA Liu Hong 1, 1 Zhou Pei-Yuan Center for Applied Mathematics, Tsinghua University, Beijing, P.R. China, 100084 Abstract. In this paper, a novel statistical model for the secondary structures of RNA is constructed, which is not only applicable to homo-RNA chains, but also can be easily extended to the hetero-RNA sequences by considering Watson-Crick base pairings. The main approach used here is the operator transfer matrix, whose calculation rules are simple, and can be automatically finished. Through the model, we found the conformational entropy of homo-RNA is highest, and grows linearly with chain length; but the hairpin/coil transition of homo-RNA is less cooperative than natural hetero-RNA sequences. As a typical example, we also verified the native secondary structures of microRNA UPSK is thermally stable with respect to model parameters, which we believe is a direct consequence of nature's evolution. Keywords: statistical model, operator matrix, secondary structure of RNA. 1. Introduction RNA is one of the most important single-stranded biomolecules in cell. It serves as a key linkage between DNA and proteins, which transcripts the genetic information of DNA, and then translates into different types of functional proteins. RNA is constituted of four different types of nucleotides, Adenine(A), Uracil(U), Cylosine(C) and Guanine(G). In most cases, Adenine and Uracil form two hydrogen bonds between them; while Cylosine and Guanine form three hydrogen bonds. This rule is called the Watson-Crick base pairing[1]. Other types of hydrogen bonding pairs are seldom seen. To fulfill its biological functions, RNA must adopt some specific three dimensional structure, whose overall topology is usually characterized by a shape of inverted letter “L”; while its scaffold is provided by the secondary structure elements --hairpins, hairpin loops, interior loops and bugles. These secondary structures are thermally stable independent of the tertiary structure, and determine a large fraction of final RNA structure[1]. In past several decades, plenty of studies have been dedicated to the prediction of three-dimensional structure of RNA, especially the secondary structure. One of the first attempts was made by Nussinov and co-workers[2,3]. They built a simple nearest-neighbour energy model, and used the method of dynamic programming to try to maximize the total number of base-pairs. The energy parameters they adopted were derived from empirical calorimetric experiments[4,5], as well as known structures[6]. Later Zuker and Stiegler[7] proposed a slightly refined approach to incorporate stacking effect. In 1999, Rivas and Eddy[8] published new algorithms to handle the pseudoknots, which must be excluded from previous standard recursions. An alternative computational approach was to sample the RNA structures from Boltzmann ensemble[9,10]. Besides many computational studies, statistical mechanical models for RNA have attracted general interest[11-16], which can offer the probabilities of all chemically allowed conformations, no matter how small their populations are. A famous example is the helix/coil transition theory[17], which was specifically designed for the helical structures of biopolymers. However this theory cannot be applied to RNA directly, since RNA's secondary structure is mainly made up by β-hairpins rather than α-helix. Another systematical study was done by Chen and Dill[18,19], in which they constructed a “Closed Graph Count Matrix” to describe the complex topological structure of RNA. A major drawback of Chen's model is the neglect of Watson-Crick base pairings[1], which constrains the possible applications to natural hetero-RNAs. Corresponding author. Tel.: + 86-010-6279 7075. E-mail address: hong-l04@mails.tsinghua.edu.cn. 473 In current paper, we have extended the transfer matrix method used in classical helix/coil transition theory, and applied it to the study of RNA's secondary structures. The construction of operator transfer matrix and corresponding calculation rules for homo- and hetero-RNAs was illustrated in the methodology part. In the application part, we compared the conformational entropy and hairpin/coil transition for different RNA sequences. We also studied the free energy surface of a typical microRNA -- UPSK, which was proved thermally stable against all parameters. 2. Basic Model 2.1. Polymer graph Fig. 1: (a) Sample secondary structure of RNA. (b) Corresponding polymer graph and nucleotide coding of RNA. In Fig. 1(a), a typical RNA structure is illustrated, with nucleotides numbered from 1 to 33 along the chain. As we can see, if we neglect tertiary topology and only focus on secondary structures, all chemically allowed conformations of RNA would just depend on the state of each nucleotide. According to different types of secondary structures elements, the states of nucleotides can be simply divided into three different classes: hairpin residues (form hydrogen bonds with another nucleotide), loop residues (in the hairpin loops, interior loops or bugles, form no hydrogen bonds) and coil residues (at the two ends of a RNA chain, without hydrogen bonds). The hydrogen bonds between hairpin residues are very strong. In fact, they are the dominant effect in maintaining the stability of secondary structures in RNA. The coil residues and loop residues are similar, which do not form any hydrogen bonds with other nucleotides. However, the coil residues at two ends of the RNA chain are totally free; while the torsion angles of loop residues are geometrically constrained. This leads to a minor energy difference between coil residues and loop residues. The secondary structure of RNA can be systematically expressed through a polymer graph (see Fig. 1(b)), which can be constructed according to following steps[18,19]: (1) If two nucleotides form a pair through hydrogen bonding, a link between them will be added. The residue at the left of a link will be denoted as a , and the one at the right as b . (2) The loop residues( l ) distribute between hairpin residues and form no links. Here we neglect the difference between hairpin loops, interior loops and bugles for simplicity. (3) The coil residues stay at the two ends of a RNA chain and without links too. The ones at the beginning of a chain are written as c 0 , while those at the ending as c1 . Based on the polymer graph, we will construct the transfer matrix for the calculation of partition function. 2.2. Statistical weights and operators The transfer matrix, which has been extensively studied in the helix/coil transition theory[17], is a powerful method to generate the possible conformations of chain molecules. However, its original version is limited to α-helix, and cannot guarantee the correct generation of all secondary structures of RNA. As a result, we need to extend the numerical transfer matrix to an operator one, which is constituted by two parts: the numerical statistical weights and the operators. The statistical weight shows the relative probability of each state that a nucleotide can adopt, which is listed in Table. 1. Value 1 is arbitrarily assigned to coil residues, since only the relative ratio is effective; the parameter t is given to loop residues, accounting for geometrical constraints on their possible torsion angles; while the parameter is for hairpin residues, representing the hydrogen bonding between two nucleotides. 474 The operators are represented by the capitals in Table. 1, and correspond to the states of nucleotides one to one. To guarantee the correct generation of all chemically allowed conformations of RNA, we argue that parameters ( ,t , ) can communicate with operators; but operators ( A, B ,C ) cannot change order with each other. They must be calculated from left to right according to following rules: CC C , ACA AA , BCB BB , ACB C . (1) Rules in Eq. 1 can be easily deduced from the general consideration of hydrogen bonding pairs in RNA. Since coil residues do not form any hydrogen bond, we can shorten the operator chain as CC C , ACA AA , BCB BB . Additionally, two hairpin residues in state a and b can form a hydrogen bond between them, thus we design the simplification rule ACB C . In fact, it can be proved that rules in Eq. 1 are both self-consistent and sufficient to guarantee the correct generation of all chemically allowed conformations of RNA. Table 1: States, statistical weights and operators of nucleotides in RNA state weight operator description c0 a 1 C coil residue, at the beginning of a chain hairpin residue, on the left of a link l b t A C B loop residue, between hairpin residues c1 1 C coil residue, at the end of a chai hairpin residue, on the right of a link 2.3. Operator transfer matrix Now we can construct the operator transfer matrix for homo-RNA in the same way as in the traditional helix/coil transition theory[17]. In general, the left-most vertical column of the matrix stands for each state of nucleotide i ; while the top horizontal column represents the states of nucleotide i 1 . Then if the state of nucleotide i 1 is not allowed to appear after that of nucleotide i in RNA, its corresponding value is set as zero; otherwise it will be the product of statistical weight and operator of nucleotide i 1 as shown in Table. 1. In this way, we construct the operator transfer matrix as C 0 M 0 0 0 A 0 A tC A tC A tC 0 0 0 B 0 . B C 0 0 0 C Here we introduce an additional parameter to take in the edging effect of a loop. As a consequence, the partition function for a RNA chain with n nucleotides can be written as Z 1 VM n U 0 A B 0,C 1 , (2) (3) where V (1 0 0 0 0) is a 1 5 vector, and U (0 0 0 1 1)T is a 5 1 vector. A dramatic difference from traditional calculation of partition function is that here we need to deal with some specific operator chains generated through the multiplication of operator transfer matrix, which can be simplified according to the rules listed in Eq. 1. For each operator chain, if all operators except C are cancelled in the end, its corresponding statistical weights will appear in the final partition function; otherwise its contribution will be zero. This procedure can be automatically done through the symbolic calculation of computers. Once the partition function obtained, a variety of important statistical quantities that characterize the secondary structure of RNA can be derived[20], i.e. the average number of hairpin residues nb ln Z ln , loop residues n l the average number of loops k l ln Z ln t , coil residues nc n n b n l , and ln Z ln etc. Another notable quantity is the specific heat capacity, which is given as 475 2 2 ln Z ln 2 t ln Z ln 2 C v k B ln 2 2 2 (ln t ) (ln ) where 2 ln Z , 2 (ln ) (4) denotes the statistical ensemble average. In above calculation, we assume model parameters ( ,t , ) are independent of the temperature, which is valid only when the temperature variation is not very large. 2.4. Extension to hetero-RNA Above argument is valid only for homo-RNA chains. To account for the natural hetero-RNA sequence, we need to correlate the statistical weights and operators with the different types of nucleotides, i.e. Adenine(A), Uracil(U), Cylosine(C) and Guanine(G) respectively. Accordingly, the calculation rules of operators will be extended as CC C ,A A ,U ,C ,G CA A ,U ,C ,G A A ,U ,C ,G A A ,U ,C ,G ,B A ,U ,C ,G CB A ,U ,C ,G B A ,U ,C ,G B A ,U ,C ,G , A A CB A A ,C ,G U U A CB U A U ,C ,G C G C A ,U ,C A CB G A CB C G C, (5) (6) A ,U ,G A CB A CB A CB 0. (7) Note the rules in Eq. 6 account for Watson-Crick base pairs; while rules in Eq. 7 are for mismatched nucleotide pairs. A CB 3. Applications 3.1. Conformational entropy Conformational entropy ( S c ) makes significant contributions to the stability of RNA structures. It is an important thermal quantity, and directly related with the total number of possible RNA conformation through S c ln (n ) , where n stands for the chain length. In Fig. 2, we studied the length dependence of conformational entropy for three different types of RNA sequences. As we can see there is clearly a linear relationship between the conformational entropy and the chain length. For homo-RNA, least square linear fitting gives S c 1.79 0.81n ; for hetero-RNA sequence, we show two different cases. One is 100 randomly generated hetero-RNA sequences, which gives an n /2 n /2 average S c 1.16 0.476n ; the other is the most “stable” RNA sequence ( C G ), which is chosen to form most Watson-Crick base pairs. In this case, linear fitting shows S c 1.18 0.644n . Fig. 2: Comparison of conformational entropy for different RNA sequences. Red squares stand for numerical data of n /2 n /2 homo-RNA based on Eq. 3; blue triangles for the data of hetero-RNA sequence C G , which is chosen to form most hydrogen bonds; and purple circles for average values of 100 randomly generated hetero-RNA sequences. Therefore when the chain length is equal, homo-RNA sequence gives the most possible conformations, as any two nonconsecutive nucleotides in homo-RNA may form a hydrogen bonding pair. The possible conformations of hetero-RNA sequence are far fewer, especially the randomly generated RNA sequences, which produce the fewest conformations due to quite limited possible Watson-Crick base pairs. For natural n /2 n /2 RNA sequences, their conformation entropy stays in between of the C G sequence and the 476 randomly generated ones. Thus their conformational space is neither too large nor too small, a compromise between fast folding and biological evolution. A direct consequence of the linear relationship between conformational entropy and chain length is the total number of possible RNA conformations will grow exponentially with the chain length. In fact, this phenomenon is not a special case, but another form of Levinthal's problem in RNA[21]. As a result, the exact enumeration of all secondary structures of RNA is a NP-hard problem. A rough estimation shows only the partition function of RNA sequence with n 50 is computationally acceptable at present. 3.2. Hairpin/coil transition in RNA The formation of RNA's secondary structure is a highly cooperative process. Here we compare the hairpin/coil transition curves for different RNA sequences, which include a 20-nucleotides homo-RNA sequence, a C 10G 10 sequence and a hetero-RNA sequence GUUCUCGAUCUCUAAAAUCG by removing the first three nucleotides from the microRNA UPSK (denoted as UPSK20). The parameter values in our calculation are taken from the experimental data by Turner[4,5], which shows the bonding energies are around 0.9 ~ 1.3kcal mol 1 K 1 for AU base pairs, and 2.1 ~ 2.4kcal mol 1 K 1 for CG base pairs at 310K . Thus according to the Boltzmann relationship, we can infer A U 2.07 2.88 and C G 5.5 7.1 . For the edging effect of a loop region, since we here did not distinguish the different types of hairpin loops, interior loops and bugles, we will take an average value 4kcal mol 1 K 1 , which gives 0.0015 . The value of t can be gotten from the energy changes with respect to the loop length, which vary from 0.1 to 0.4kcal mol 1 K 1 . Thus t 0.52 0.85 . In later computations, we set t 0.72 (corresponding to 0.2kcal mol 1 K 1 ) for simplicity. Fig. 3: Comparison of hairpin/coil transition in different RNA sequences with chain length n 20 . The dependence of average number of hairpin residues, loop residues and loops on parameter are shown in subplots (a-c); while the dependence of specific heat capacity on parameters , t and are shown in subplots (d-f) separately. As we can see in Fig. 3(a,b,c), the transitional behaviors of homo-RNA and C 10G 10 are quite similar. With the increase of hydrogen bonding effect , the average numbers of hairpin residues, loop residues and loops for the homo-RNA and sequence C 10G 10 grow quickly from zero to their maximum values ( nb 18, n l 2, k l 1 ), then stay constantly at the most stable structure. While the curves of UPSK are relatively different: below 1 , no apparent hairpin structures could be detected; from 2 to 4 , most of the native hydrogen bonding pairs are formed until 20 477 nb 8, n l 7, k l 1 , which just corresponds to the native secondary structure of UPSK (see Fig. 4(a)). Here we did not show results for different parameters t and , however their qualitative behaviours are quite clear. In general, the influence of parameters t and on the average number of hairpin residues, loop residues and loops are quite limited, but the sharpness of transition curves will sensitively depend on t and . The smaller t and are, the sharper the transition curves will be, which also means higher cooperativity of the hairpin/coil transition of RNA. The folding cooperativity and thermal stability of RNA sequences could be directly learned from the study of specific heat capacity. From Fig. 3(d,e,f), we found the homo-RNA and sequence C 10G 10 have relatively the smallest specific heat capacity. Since these two sequences can form most hydrogen bonding pairs, their thermal variations with parameters , t and are very small, expect for a sharp peak at 1 , which corresponds to the conformational transition from coil to hairpin structure. On the contrary, the specific heat capacity of UPSK20 varies more largely with model parameters, which we believe is mainly due to many alternative conformations produced by Watson-Crick base pairings in hetero-RNA. To be more exactly, there are two peaks and one valley on the C v curve for UPSK20 in Fig. 3(d). The first peak corresponds to the hairpin/coil transition; the second one is for the repacking of RNA structure; while the valley shows the thermal stability of native secondary structure of UPSK20. In Fig. 3(e,f), the specific heat capacity of UPSK20 increases almost monotonously with parameters t and . The larger t and are, the easier loops can be formed. As a consequence, structure changes will also become easier, which leads to a larger value of the specific heat capacity. 3.3. Thermal stability of UPSK Fig. 4: (a) Native secondary structure of UPSK. (b) Population distribution of the native secondary structure of UPSK with respect to parameters A U and C G . Here we set 0.0015,t A t U t C t G 0.72 . In previous section, we have made some general comparisons between homo-RNA and hetero-RNA sequences on the hairpin/coil transition. Now we turn to study a specific example -- microRNA UPSK, whose natural sequence is given as UGAGUUCUCGAUCUCUAAAAUCG , including 23-nucleotides. Its native secondary structure is shown in Fig. 4(a), with eight coil residues, four consecutive pairs of hairpin residues and a hairpin-loop composed of seven loop residues. In Fig. 4(b), we can find this native secondary structure is the dominant conformation in a reasonable parameter region with A U 2.07 2.88 and C G 5.5 7.1 . However, the population of native secondary structure in this region is only about 30 40% , which means there are many other different conformations simultaneously existing in the system. Therefore the native RNA structure must be understood from a statistical ensemble point. The thermal stability of native secondary structure of UPSK (see Fig. 4(a)) can be learned from its specific heat capacity. In Fig.5, we plot the changes of specific heat capacity of UPSK with respect to each model parameter. In Fig. 5(a), when parameters A U change from 2.07 to 2.88 , the local minimum of C v moves continuously from C G 6.9 to 5.2 ; in Fig. 5(b), when C G change from 5.5 to 7.1 , the local minimum of C v moves from A U 2.1 to 1.7 . Both match very well with our 478 above showed reasonable parameter region for natural RNA, with A U 2.07 2.88 and C G 5.5 7.1 . In Fig. 5(c), for reasonable values of A U and C G , their corresponding specific heat capacities at t 0.72 do not deviate largely from the values of nearby local minimum. In Fig. 5(d), when A U 2.88 and C G 7.1 , the local minimum of C v lays at 0.005 , not far from 0.0015 the value we chosen; when A U 2.07 and C G 5.5 , the value of C v at 0.0015 is also close to local minimum. All these results strongly support the fact that native secondary structure of UPSK is thermally stable with respect to all model parameters. Or in other words, the nucleotide sequence of UPSK is optimized in the sense of thermal stability, which we believe is a direct consequence of natural selection during evolution. Fig. 5: (a) Dependence of the specific heat capacity C v of UPSK on model parameters (a) C G ; (b) A U ; (c) t ; and (d) . In all calculations, we set the non-variable parameters as A U , C G , and 0.0015,t A t U t C t G 0.72 . 4. Conclusion and Discussion In current study, based on the operator transfer matrix, a novel statistical mechanical model has been constructed for the study of RNA structures. Through numerical calculation, we found the hairpin/coil transition in homo-RNA seems less cooperative than natural RNA sequences. We further explored the thermal stability of microRNA UPSK with respect to different model parameters, which convinced the fact that the nucleotide sequence of natural RNA has been optimized during nature's evolution. A major neglect in current model is the tertiary structure, which is believed essential for biological functions of natural RNA. Although the secondary structure is thermally stable independently, a comprehensive description of RNA structure still requires the knowledge of tertiary structure. This problem may be solved through a further generalization of the operator transfer matrix. Another notable point is when extending the homo-RNA model to hetero-RNA sequences, we have introduced some specific matching rules (see Eqs. 5-7) for Watson-Crick base pairs. Despite their generality for natural RNA, there still exist several exceptions, like AG and UG pairs etc[22]. The interactions between these non-Watson-Crick base pairs are weaker than the Watson-Crick base pairs. Their natural occurrence is also relatively low. Our current model can be easily extended to include these cases, just by modifying the calculation rules of operators in Eqs. 5-7. This work is supported in part by the Tsinghua University Initiative Scientific Research Program (20101081751). 479 5. References [1] A.L. Lehninger, et al. Principles of Biochemistry. W.H. Freeman, 2005. [2] R. Nussinov, G. Piecznik, J.R. Grigg and D.J. Kleitman. Algorithms for loop matchings, SIAM Journal on Applied Mathematics. 1978. [3] R. Nussinov and A.B. Jacobson. Fast algorithm for predicting the secondary structure of single-stranded RNA, PNAS. 1980, 77:6309-6313. [4] www.bioinfo.rpi.edu/zukerm/rna/energy/. [5] D.H. Mathews, et al. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure, PNAS. 2004, 101:7287-7292. [6] C.B. Do, D.A. Woods and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physicsbased models, Bioinformatics. 2006, 22:90-98. [7] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information, Nucleic Acids Res. 1981, 9:133-148. [8] E. Rivas and S.R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots, J. Mol. Biol. 1999, 285:2053-2068. [9] J.S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure, Biopolymers. 1990, 29:1105-1119. [10] Y. Ding and C.E. Lawrence. A statistical sampling algorithm for RNA secondary structure prediction, Nucleic Acids Res. 2003, 31:7280-7301. [11] P.C. Bevilacqua and J.M. Blose. Structures, kinetics, thermodynamics, and biological functions of RNA hairpins, Annu. Rev. Phys. Chem. 2008, 59:79-103. [12] S.J. Chen and K.A. Dill. RNA folding energy landscapes, PNAS. 2000, 97:646-651. [13] E. Tostesen, S.J. Chen and K.A. Dill. RNA folding transitions and cooperativity, J. Phys. Chem. 2001, 105:16181630. [14] W.B. Zhang and S.J. Chen. RNA hairpin-folding kinetics, PNAS, 2002, 99:1931-1936. [15] W.B. Zhang and S.J. Chen. Exploring the complex folding kinetics of RNA hairpins: I. General folding kinetics analysis, Biophysical Journal, 2006, 90:765-777. [16] W.B. Zhang and S.J. Chen. Exploring the complex folding kinetics of RNA hairpins: II. Effect of sequence, length, and misfolded states, Biophysical Journal, 2006, 90:778-787. [17] D. Poland and H.A. Scheraga. Theory of Helix-Coil Transitions in Biopolymers. Academic Press, 1970. [18] S.J. Chen and K.A. Dill. Statistical thermodynamics of double-stranded polymer molecules, J. Chem. Phys. 1995, 103:5802-5813. [19] S.J. Chen and K.A. Dill. Theory for the conformational changes of double-stranded chain molecules, J. Chem. Phys. 1998, 109:4602-4615. [20] Kerson Huang. Statistical Mechanics. Wiley, 1963. [21] C. Levinthal. Are there pathways far protein folding? J. Chem. Phys. Phys. Biol. 1968, 65:44-45. [22] U. Nagaswamy, N. Voss, Z. Zhang and G.E. Fox. Database of non-canonical base pairs found in known RNA structures, Nucleic Acids Res. 2000, 28:375-376. 480