A Novel Statistical Model for the Secondary Structure of RNA

advertisement
ISBN 978-1-84626-093-3
Proceedings of the 5th International Congress on Mathematical Biology (ICMB2011)
Vol. 3
Nanjing, P. R. China, June 3-5, 2011
A Novel Statistical Model for the Secondary Structure of RNA
Liu Hong 1, 
1
Zhou Pei-Yuan Center for Applied Mathematics, Tsinghua University, Beijing, P.R. China, 100084
Abstract. In this paper, a novel statistical model for the secondary structures of RNA is constructed, which
is not only applicable to homo-RNA chains, but also can be easily extended to the hetero-RNA sequences by
considering Watson-Crick base pairings. The main approach used here is the operator transfer matrix, whose
calculation rules are simple, and can be automatically finished. Through the model, we found the
conformational entropy of homo-RNA is highest, and grows linearly with chain length; but the hairpin/coil
transition of homo-RNA is less cooperative than natural hetero-RNA sequences. As a typical example, we
also verified the native secondary structures of microRNA UPSK is thermally stable with respect to model
parameters, which we believe is a direct consequence of nature's evolution.
Keywords: statistical model, operator matrix, secondary structure of RNA.
1. Introduction
RNA is one of the most important single-stranded biomolecules in cell. It serves as a key linkage
between DNA and proteins, which transcripts the genetic information of DNA, and then translates into
different types of functional proteins. RNA is constituted of four different types of nucleotides, Adenine(A),
Uracil(U), Cylosine(C) and Guanine(G). In most cases, Adenine and Uracil form two hydrogen bonds
between them; while Cylosine and Guanine form three hydrogen bonds. This rule is called the Watson-Crick
base pairing[1]. Other types of hydrogen bonding pairs are seldom seen. To fulfill its biological functions,
RNA must adopt some specific three dimensional structure, whose overall topology is usually characterized
by a shape of inverted letter “L”; while its scaffold is provided by the secondary structure elements --hairpins, hairpin loops, interior loops and bugles. These secondary structures are thermally stable
independent of the tertiary structure, and determine a large fraction of final RNA structure[1].
In past several decades, plenty of studies have been dedicated to the prediction of three-dimensional
structure of RNA, especially the secondary structure. One of the first attempts was made by Nussinov and
co-workers[2,3]. They built a simple nearest-neighbour energy model, and used the method of dynamic
programming to try to maximize the total number of base-pairs. The energy parameters they adopted were
derived from empirical calorimetric experiments[4,5], as well as known structures[6]. Later Zuker and
Stiegler[7] proposed a slightly refined approach to incorporate stacking effect. In 1999, Rivas and Eddy[8]
published new algorithms to handle the pseudoknots, which must be excluded from previous standard
recursions. An alternative computational approach was to sample the RNA structures from Boltzmann
ensemble[9,10].
Besides many computational studies, statistical mechanical models for RNA have attracted general
interest[11-16], which can offer the probabilities of all chemically allowed conformations, no matter how
small their populations are. A famous example is the helix/coil transition theory[17], which was specifically
designed for the helical structures of biopolymers. However this theory cannot be applied to RNA directly,
since RNA's secondary structure is mainly made up by β-hairpins rather than α-helix. Another systematical
study was done by Chen and Dill[18,19], in which they constructed a “Closed Graph Count Matrix” to
describe the complex topological structure of RNA. A major drawback of Chen's model is the neglect of
Watson-Crick base pairings[1], which constrains the possible applications to natural hetero-RNAs.

Corresponding author. Tel.: + 86-010-6279 7075.
E-mail address: hong-l04@mails.tsinghua.edu.cn.
473
In current paper, we have extended the transfer matrix method used in classical helix/coil transition
theory, and applied it to the study of RNA's secondary structures. The construction of operator transfer
matrix and corresponding calculation rules for homo- and hetero-RNAs was illustrated in the methodology
part. In the application part, we compared the conformational entropy and hairpin/coil transition for different
RNA sequences. We also studied the free energy surface of a typical microRNA -- UPSK, which was proved
thermally stable against all parameters.
2. Basic Model
2.1. Polymer graph
Fig. 1: (a) Sample secondary structure of RNA. (b) Corresponding polymer graph and nucleotide coding of RNA.
In Fig. 1(a), a typical RNA structure is illustrated, with nucleotides numbered from 1 to 33 along the
chain. As we can see, if we neglect tertiary topology and only focus on secondary structures, all chemically
allowed conformations of RNA would just depend on the state of each nucleotide. According to different
types of secondary structures elements, the states of nucleotides can be simply divided into three different
classes: hairpin residues (form hydrogen bonds with another nucleotide), loop residues (in the hairpin loops,
interior loops or bugles, form no hydrogen bonds) and coil residues (at the two ends of a RNA chain, without
hydrogen bonds). The hydrogen bonds between hairpin residues are very strong. In fact, they are the
dominant effect in maintaining the stability of secondary structures in RNA. The coil residues and loop
residues are similar, which do not form any hydrogen bonds with other nucleotides. However, the coil
residues at two ends of the RNA chain are totally free; while the torsion angles of loop residues are
geometrically constrained. This leads to a minor energy difference between coil residues and loop residues.
The secondary structure of RNA can be systematically expressed through a polymer graph (see Fig. 1(b)),
which can be constructed according to following steps[18,19]: (1) If two nucleotides form a pair through
hydrogen bonding, a link between them will be added. The residue at the left of a link will be denoted as a ,
and the one at the right as b . (2) The loop residues( l ) distribute between hairpin residues and form no links.
Here we neglect the difference between hairpin loops, interior loops and bugles for simplicity. (3) The coil
residues stay at the two ends of a RNA chain and without links too. The ones at the beginning of a chain are
written as c 0 , while those at the ending as c1 . Based on the polymer graph, we will construct the transfer
matrix for the calculation of partition function.
2.2. Statistical weights and operators
The transfer matrix, which has been extensively studied in the helix/coil transition theory[17], is a
powerful method to generate the possible conformations of chain molecules. However, its original version is
limited to α-helix, and cannot guarantee the correct generation of all secondary structures of RNA. As a
result, we need to extend the numerical transfer matrix to an operator one, which is constituted by two parts:
the numerical statistical weights and the operators.
The statistical weight shows the relative probability of each state that a nucleotide can adopt, which is
listed in Table. 1. Value 1 is arbitrarily assigned to coil residues, since only the relative ratio is effective; the
parameter t is given to loop residues, accounting for geometrical constraints on their possible torsion angles;
while the parameter  is for hairpin residues, representing the hydrogen bonding between two nucleotides.
474
The operators are represented by the capitals in Table. 1, and correspond to the states of nucleotides one
to one. To guarantee the correct generation of all chemically allowed conformations of RNA, we argue that
parameters (  ,t , ) can communicate with operators; but operators ( A, B ,C ) cannot change order with
each other. They must be calculated from left to right according to following rules:
CC  C , ACA  AA , BCB  BB , ACB  C .
(1)
Rules in Eq. 1 can be easily deduced from the general consideration of hydrogen bonding pairs in RNA.
Since coil residues do not form any hydrogen bond, we can shorten the operator chain as
CC  C , ACA  AA , BCB  BB . Additionally, two hairpin residues in state a and b can form a hydrogen
bond between them, thus we design the simplification rule ACB  C . In fact, it can be proved that rules in
Eq. 1 are both self-consistent and sufficient to guarantee the correct generation of all chemically allowed
conformations of RNA.
Table 1: States, statistical weights and operators of nucleotides in RNA
state
weight
operator
description
c0
a
1
C
coil residue, at the beginning of a chain

hairpin residue, on the left of a link
l
b
t

A
C
B
loop residue, between hairpin residues
c1
1
C
coil residue, at the end of a chai
hairpin residue, on the right of a link
2.3. Operator transfer matrix
Now we can construct the operator transfer matrix for homo-RNA in the same way as in the traditional
helix/coil transition theory[17]. In general, the left-most vertical column of the matrix stands for each state of
nucleotide i ; while the top horizontal column represents the states of nucleotide i  1 . Then if the state of
nucleotide i  1 is not allowed to appear after that of nucleotide i in RNA, its corresponding value is set as
zero; otherwise it will be the product of statistical weight and operator of nucleotide i  1 as shown in
Table. 1. In this way, we construct the operator transfer matrix as
C

0
M  0

0

0
A
0
A tC
A tC
A tC
0

0 0
B 0  .
B C 

0
0
0 C
Here we introduce an additional parameter  to take in the edging effect of a loop.
As a consequence, the partition function for a RNA chain with n nucleotides can be written as
Z  1  VM n U
0
A  B  0,C  1
,
(2)
(3)
where V  (1 0 0 0 0) is a 1  5 vector, and U  (0 0 0 1 1)T is a 5  1 vector. A dramatic difference
from traditional calculation of partition function is that here we need to deal with some specific operator
chains generated through the multiplication of operator transfer matrix, which can be simplified according to
the rules listed in Eq. 1. For each operator chain, if all operators except C are cancelled in the end, its
corresponding statistical weights will appear in the final partition function; otherwise its contribution will be
zero. This procedure can be automatically done through the symbolic calculation of computers.
Once the partition function obtained, a variety of important statistical quantities that characterize the
secondary structure of RNA can be derived[20], i.e. the average number of hairpin residues
nb
  ln Z  ln  , loop residues n l
the average number of loops k l
  ln Z  ln t , coil residues nc  n  n b  n l , and
  ln Z  ln  etc. Another notable quantity is the specific heat
capacity, which is given as
475


 2
  2 ln Z 
  ln 2 t   ln Z   ln 2
C v  k B ln 2  
2 
2
 (ln t ) 



 (ln  ) 
where
  2 ln Z 
 ,
2 
 (ln  ) 
 
(4)
denotes the statistical ensemble average. In above calculation, we assume model parameters
(  ,t , ) are independent of the temperature, which is valid only when the temperature variation is not very
large.
2.4. Extension to hetero-RNA
Above argument is valid only for homo-RNA chains. To account for the natural hetero-RNA sequence,
we need to correlate the statistical weights and operators with the different types of nucleotides, i.e.
Adenine(A), Uracil(U), Cylosine(C) and Guanine(G) respectively. Accordingly, the calculation rules of
operators will be extended as
CC  C ,A A ,U ,C ,G CA A ,U ,C ,G  A A ,U ,C ,G A A ,U ,C ,G ,B A ,U ,C ,G CB A ,U ,C ,G  B A ,U ,C ,G B A ,U ,C ,G ,
A
A CB
A
A ,C ,G
U
U
 A CB
U
A
U ,C ,G
C
G
C
A ,U ,C
 A CB
G
 A CB
C
G
 C,
(5)
(6)
A ,U ,G
 A CB
 A CB
 A CB
 0.
(7)
Note the rules in Eq. 6 account for Watson-Crick base pairs; while rules in Eq. 7 are for mismatched
nucleotide pairs.
A CB
3. Applications
3.1. Conformational entropy
Conformational entropy ( S c ) makes significant contributions to the stability of RNA structures. It is an
important thermal quantity, and directly related with the total number of possible RNA conformation through
S c  ln (n ) , where n stands for the chain length. In Fig. 2, we studied the length dependence of
conformational entropy for three different types of RNA sequences.
As we can see there is clearly a linear relationship between the conformational entropy and the chain
length. For homo-RNA, least square linear fitting gives S c  1.79  0.81n ; for hetero-RNA sequence,
we show two different cases. One is 100 randomly generated hetero-RNA sequences, which gives an
n /2
n /2
average S c  1.16  0.476n ; the other is the most “stable” RNA sequence ( C  G   ), which is
chosen to form most Watson-Crick base pairs. In this case, linear fitting shows S c  1.18  0.644n .
Fig. 2: Comparison of conformational entropy for different RNA sequences. Red squares stand for numerical data of
n /2
n /2
homo-RNA based on Eq. 3; blue triangles for the data of hetero-RNA sequence C  G   , which is chosen to form
most hydrogen bonds; and purple circles for average values of 100 randomly generated hetero-RNA sequences.
Therefore when the chain length is equal, homo-RNA sequence gives the most possible conformations,
as any two nonconsecutive nucleotides in homo-RNA may form a hydrogen bonding pair. The possible
conformations of hetero-RNA sequence are far fewer, especially the randomly generated RNA sequences,
which produce the fewest conformations due to quite limited possible Watson-Crick base pairs. For natural
n /2
n /2
RNA sequences, their conformation entropy stays in between of the C  G   sequence and the
476
randomly generated ones. Thus their conformational space is neither too large nor too small, a compromise
between fast folding and biological evolution.
A direct consequence of the linear relationship between conformational entropy and chain length is the
total number of possible RNA conformations will grow exponentially with the chain length. In fact, this
phenomenon is not a special case, but another form of Levinthal's problem in RNA[21]. As a result, the exact
enumeration of all secondary structures of RNA is a NP-hard problem. A rough estimation shows only the
partition function of RNA sequence with n  50 is computationally acceptable at present.
3.2. Hairpin/coil transition in RNA
The formation of RNA's secondary structure is a highly cooperative process. Here we compare the
hairpin/coil transition curves for different RNA sequences, which include a 20-nucleotides homo-RNA
sequence, a C 10G 10 sequence and a hetero-RNA sequence GUUCUCGAUCUCUAAAAUCG by removing the first
three nucleotides from the microRNA UPSK (denoted as UPSK20).
The parameter values in our calculation are taken from the experimental data by Turner[4,5], which
shows the bonding energies are around  0.9 ~  1.3kcal  mol 1  K 1 for AU base pairs, and
 2.1 ~  2.4kcal  mol 1  K 1 for CG base pairs at 310K . Thus according to the Boltzmann
relationship, we can infer  A   U  2.07  2.88 and  C   G  5.5  7.1 . For the edging effect
of a loop region, since we here did not distinguish the different types of hairpin loops, interior loops and
bugles, we will take an average value 4kcal  mol 1  K 1 , which gives   0.0015 . The value of t can
be gotten from the energy changes with respect to the loop length, which vary from  0.1 to
 0.4kcal  mol 1  K 1 . Thus t  0.52  0.85 . In later computations, we set t  0.72
(corresponding to  0.2kcal  mol 1  K 1 ) for simplicity.
Fig. 3: Comparison of hairpin/coil transition in different RNA sequences with chain length n  20 . The dependence
of average number of hairpin residues, loop residues and loops on parameter  are shown in subplots (a-c); while the
dependence of specific heat capacity on parameters  , t and  are shown in subplots (d-f) separately.
As we can see in Fig. 3(a,b,c), the transitional behaviors of homo-RNA and C 10G 10 are quite similar.
With the increase of hydrogen bonding effect  , the average numbers of hairpin residues, loop residues and
loops for the homo-RNA and sequence C 10G 10 grow quickly from zero to their maximum values
( nb
 18, n l
 2, k l
 1 ), then stay constantly at the most stable structure. While the curves of
UPSK are relatively different: below   1 , no apparent hairpin structures could be detected; from
  2 to   4 , most of the native hydrogen bonding pairs are formed until
20
477
nb
 8, n l
 7, k l
 1 , which just corresponds to the native secondary structure of UPSK (see Fig.
4(a)). Here we did not show results for different parameters t and  , however their qualitative behaviours
are quite clear. In general, the influence of parameters t and  on the average number of hairpin residues,
loop residues and loops are quite limited, but the sharpness of transition curves will sensitively depend on t
and  . The smaller t and  are, the sharper the transition curves will be, which also means higher
cooperativity of the hairpin/coil transition of RNA.
The folding cooperativity and thermal stability of RNA sequences could be directly learned from the
study of specific heat capacity. From Fig. 3(d,e,f), we found the homo-RNA and sequence C 10G 10 have
relatively the smallest specific heat capacity. Since these two sequences can form most hydrogen bonding
pairs, their thermal variations with parameters  , t and  are very small, expect for a sharp peak at   1 ,
which corresponds to the conformational transition from coil to hairpin structure. On the contrary, the
specific heat capacity of UPSK20 varies more largely with model parameters, which we believe is mainly due
to many alternative conformations produced by Watson-Crick base pairings in hetero-RNA. To be more
exactly, there are two peaks and one valley on the C v curve for UPSK20 in Fig. 3(d). The first peak
corresponds to the hairpin/coil transition; the second one is for the repacking of RNA structure; while the
valley shows the thermal stability of native secondary structure of UPSK20. In Fig. 3(e,f), the specific heat
capacity of UPSK20 increases almost monotonously with parameters t and  . The larger t and  are, the
easier loops can be formed. As a consequence, structure changes will also become easier, which leads to a
larger value of the specific heat capacity.
3.3. Thermal stability of UPSK
Fig. 4: (a) Native secondary structure of UPSK. (b) Population distribution of the native secondary structure of UPSK
with respect to parameters  A   U and  C   G . Here we set   0.0015,t A  t U  t C  t G  0.72 .
In previous section, we have made some general comparisons between homo-RNA and hetero-RNA
sequences on the hairpin/coil transition. Now we turn to study a specific example -- microRNA UPSK,
whose natural sequence is given as UGAGUUCUCGAUCUCUAAAAUCG , including 23-nucleotides. Its native
secondary structure is shown in Fig. 4(a), with eight coil residues, four consecutive pairs of hairpin residues
and a hairpin-loop composed of seven loop residues.
In Fig. 4(b), we can find this native secondary structure is the dominant conformation in a reasonable
parameter region with  A   U  2.07  2.88 and  C   G  5.5  7.1 . However, the population
of native secondary structure in this region is only about 30  40% , which means there are many other
different conformations simultaneously existing in the system. Therefore the native RNA structure must be
understood from a statistical ensemble point.
The thermal stability of native secondary structure of UPSK (see Fig. 4(a)) can be learned from its
specific heat capacity. In Fig.5, we plot the changes of specific heat capacity of UPSK with respect to each
model parameter. In Fig. 5(a), when parameters  A   U change from 2.07 to 2.88 , the local minimum
of C v moves continuously from  C   G  6.9 to 5.2 ; in Fig. 5(b), when  C   G change from 5.5
to 7.1 , the local minimum of C v moves from  A   U  2.1 to 1.7 . Both match very well with our
478
above showed reasonable parameter region for natural RNA, with  A   U  2.07  2.88 and
 C   G  5.5  7.1 . In Fig. 5(c), for reasonable values of  A   U and  C   G , their
corresponding specific heat capacities at t  0.72 do not deviate largely from the values of nearby local
minimum. In Fig. 5(d), when  A   U  2.88 and  C   G  7.1 , the local minimum of C v lays at
  0.005 , not far from   0.0015 the value we chosen; when  A   U  2.07 and
 C   G  5.5 , the value of C v at   0.0015 is also close to local minimum.
All these results strongly support the fact that native secondary structure of UPSK is thermally stable
with respect to all model parameters. Or in other words, the nucleotide sequence of UPSK is optimized in the
sense of thermal stability, which we believe is a direct consequence of natural selection during evolution.
Fig. 5: (a) Dependence of the specific heat capacity C v of UPSK on model parameters (a)  C   G ; (b)  A   U ;
(c) t ; and (d)  . In all calculations, we set the non-variable parameters as  A   U ,  C   G , and
  0.0015,t A  t U  t C  t G  0.72 .
4. Conclusion and Discussion
In current study, based on the operator transfer matrix, a novel statistical mechanical model has been
constructed for the study of RNA structures. Through numerical calculation, we found the hairpin/coil
transition in homo-RNA seems less cooperative than natural RNA sequences. We further explored the
thermal stability of microRNA UPSK with respect to different model parameters, which convinced the fact
that the nucleotide sequence of natural RNA has been optimized during nature's evolution.
A major neglect in current model is the tertiary structure, which is believed essential for biological
functions of natural RNA. Although the secondary structure is thermally stable independently, a
comprehensive description of RNA structure still requires the knowledge of tertiary structure. This problem
may be solved through a further generalization of the operator transfer matrix.
Another notable point is when extending the homo-RNA model to hetero-RNA sequences, we have
introduced some specific matching rules (see Eqs. 5-7) for Watson-Crick base pairs. Despite their generality
for natural RNA, there still exist several exceptions, like AG and UG pairs etc[22]. The interactions between
these non-Watson-Crick base pairs are weaker than the Watson-Crick base pairs. Their natural occurrence is
also relatively low. Our current model can be easily extended to include these cases, just by modifying the
calculation rules of operators in Eqs. 5-7.
This work is supported in part by the Tsinghua University Initiative Scientific Research Program
(20101081751).
479
5. References
[1] A.L. Lehninger, et al. Principles of Biochemistry. W.H. Freeman, 2005.
[2] R. Nussinov, G. Piecznik, J.R. Grigg and D.J. Kleitman. Algorithms for loop matchings, SIAM Journal on Applied
Mathematics. 1978.
[3] R. Nussinov and A.B. Jacobson. Fast algorithm for predicting the secondary structure of single-stranded RNA,
PNAS. 1980, 77:6309-6313.
[4] www.bioinfo.rpi.edu/zukerm/rna/energy/.
[5] D.H. Mathews, et al. Incorporating chemical modification constraints into a dynamic programming algorithm for
prediction of RNA secondary structure, PNAS. 2004, 101:7287-7292.
[6] C.B. Do, D.A. Woods and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physicsbased models, Bioinformatics. 2006, 22:90-98.
[7] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary
information, Nucleic Acids Res. 1981, 9:133-148.
[8] E. Rivas and S.R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots,
J. Mol. Biol. 1999, 285:2053-2068.
[9] J.S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure,
Biopolymers. 1990, 29:1105-1119.
[10] Y. Ding and C.E. Lawrence. A statistical sampling algorithm for RNA secondary structure prediction, Nucleic
Acids Res. 2003, 31:7280-7301.
[11] P.C. Bevilacqua and J.M. Blose. Structures, kinetics, thermodynamics, and biological functions of RNA hairpins,
Annu. Rev. Phys. Chem. 2008, 59:79-103.
[12] S.J. Chen and K.A. Dill. RNA folding energy landscapes, PNAS. 2000, 97:646-651.
[13] E. Tostesen, S.J. Chen and K.A. Dill. RNA folding transitions and cooperativity, J. Phys. Chem. 2001, 105:16181630.
[14] W.B. Zhang and S.J. Chen. RNA hairpin-folding kinetics, PNAS, 2002, 99:1931-1936.
[15] W.B. Zhang and S.J. Chen. Exploring the complex folding kinetics of RNA hairpins: I. General folding kinetics
analysis, Biophysical Journal, 2006, 90:765-777.
[16] W.B. Zhang and S.J. Chen. Exploring the complex folding kinetics of RNA hairpins: II. Effect of sequence, length,
and misfolded states, Biophysical Journal, 2006, 90:778-787.
[17] D. Poland and H.A. Scheraga. Theory of Helix-Coil Transitions in Biopolymers. Academic Press, 1970.
[18] S.J. Chen and K.A. Dill. Statistical thermodynamics of double-stranded polymer molecules, J. Chem. Phys. 1995,
103:5802-5813.
[19] S.J. Chen and K.A. Dill. Theory for the conformational changes of double-stranded chain molecules, J. Chem.
Phys. 1998, 109:4602-4615.
[20] Kerson Huang. Statistical Mechanics. Wiley, 1963.
[21] C. Levinthal. Are there pathways far protein folding? J. Chem. Phys. Phys. Biol. 1968, 65:44-45.
[22] U. Nagaswamy, N. Voss, Z. Zhang and G.E. Fox. Database of non-canonical base pairs found in known RNA
structures, Nucleic Acids Res. 2000, 28:375-376.
480
Download