6.5 The RNA Secondary Structure Prediction with Simple Pseudoknots

advertisement
CHAPTER 6
The Secondary Structure Prediction of RNA
In Chapter 1, we mentioned that there are many different types of RNAs, such as
mRNA, tRNA and rRNA each with the different function. It is well known that the
function of an RNA is determined by its three-dimensional structure. Hence, knowing
the three-dimensional structures of RNAs is important for us to understand their
functions.
The three-dimensional structure of an RNA can be predicted experimentally by
X-ray crystallography and nuclear magnetic resonance (NMR). But these
experimental methods are difficult and quite time consuming. Moreover, they are not
always feasible because X-ray crystallography can be applied only to crystallized
molecules and NMR is limited to small molecules (< 200 amino acids at present). It
was discovered that the three-dimensional of an RNA can be uniquely determined
from its sequence (i.e., the primary sequence). Hence, much theoretical effort has
been made in determining the three-dimensional structure of an RNA from its
sequence alone. Up to now, it is still a hard work to predict the three-dimensional
structure of an RNA directly from its sequence. However, there are efficient
algorithms to predict the secondary structure of an RNA, which is useful in predicting
the three-dimensional structure. To predict the three-dimensional structure of an RNA
sequence, we can first determine its secondary structure and then predict its
three-dimensional structure according to this secondary structure (see Figure 6.1).
6-1
Figure 6.1: Folding of Phenylalanyl-Transfer-RNA into Its Spatial Structure. (a) The
Primary Structure of the RNA (b) The Secondary Structure of the RNA (c) The
Three-Dimensional Structure of the RNA
RNA is a single strand of nucleotides (bases) adenine (A), guanine (G), cytosine (C)
and uracil (U). The sequence of the bases A, G, C and U is called the primary
structure of an RNA. In RNA, G and C can form a base pair G≡C by a
triple-hydrogen bond, A and U can form a base pair A=U by a double-hydrogen bond,
and G and U can form a base pair G.U by a single hydrogen bond. Due to these
hydrogen bonds, the primary structure of an RNA can fold back on itself to form its
secondary structure. For example, suppose that we have the following RNA sequence.
A–G–G–C–C–U–U–C–C–U
Then, this sequence can fold back on itself to form many possible secondary
structures. In Figure 6.2, we show six possible secondary structures of this sequences.
In nature, however, there is only one secondary structure to correspond to an RNA
6-2
sequence. What is the actual secondary structure of an RNA sequence?
Figure 6.2: Six Possible Secondary Structures of RNA Sequence
A–G–G–C–C–U–U–C–C–U (The Dashed Lines Denote the Hydrogen Bonds)
According to the thermodynamic hypothesis, the actual secondary structure of an
RNA sequence is the one with the minimum free energy, which will be explained later.
In nature, only the stable structure can exist and the stable strucutre must be the one
with the minimum free energy. In a secondary structure of an RNA, the base pairs will
increase the structural stability, but the unpaired bases will decrease the structural
stability. The base pairs of the types G≡C and A=U (called Watson-Crick base pairs)
are more stable than that of the type G.U (called wobble base pairs). According to
these factors, we can find that the secondary structure of Figure 6.2 (f) is the actual
secondary structure of sequence A–G–G–C–C–U–U–C–C–U.
Hence, we formally define the secondary structure prediction problem as follows.
Given an RNA sequence, determine the secondary structure of the minimum free
energy from this sequence. In the following sections, we shall give a formal definition
of secondary structure of an RNA and then introduce some efficient algorithms to
predict the secondary structure of an RNA sequence.
6-3
6.1 Secondary Structure of RNA
An RNA sequence will be represented as a string of n characters R = r1r2 · · ·rn,
where ri  {A, C, G, U}. Typically, n can range from 20 to 2000. A secondary
structure of R is a set S of base pairs (ri, rj), where 1 ≒ i < j ≒ n, such that the
following conditions are satisfied.
(1) j- i > t, where t is a small positive constant. Typically, t = 3.
(2) If (ri, rj) and (rk, rl) are two base pairs in S and i ≒ k, then either
(a) i = k and j = l, i.e., (ri, rj) and (rk, rl) are the same base pair,
(b) i < j < k < l, i.e., (ri, rj) precedes (rk, rl), or
(c) i < k < l < j, i.e., (ri, rj) includes (rk, rl).
The first condition implies that RNA sequence does not fold too sharply on itself. The
second condition means that each nucleotide can take part in at most one base pair,
and guarantees that the secondary structure contains no pseudoknot. Two base pairs
(ri, rj) and (rk, rl) are called a pseudoknot if i < k < j < l (see Figure 6.3). Pseudoknots
do occur in RNA molecules, but their exclusion simpli.es the problem. By the above
de.nition, a secondary structure can be represented as an outerplanar graph with
degree at most 3, where an outerplanar graph is a graph which can be drawn in the
plane in such a way that all vertices (i.e., nucleotides) are arranged on a circle and all
edges (i.e., base pairs) lie inside the circle and do not intersect. Note that the example
in Figure 6.3, which contains a pseudoknot, cannot be represented as an outerplanar
graph.
6-4
Figure 6.3: An Example of a Pseudoknot
Recall that the goal of the secondary structure prediction is to find a secondary
structure with the minimum free energy. Hence we must have a method to calculate
the free energy of a secondary structure S. Since the formations of base pairs give
stabilizing effects to the structural free energy, the simplest method of measuring the
free energy of S is to assign an energy to each base pair of S and then the free energy
of S is the sum of the energies of all base pairs. Due to different hydrogen bonds, the
energies of base pairs are usually assigned as different values. For example, the
reasonable values for A≡U, G=C and G–U are -3, -2 and -1 (Kcal/mole), respectively.
Other possible values might be that the energies of base pairs are all equal. In this case,
the problem becomes the one of finding a secondary structure with the maximum
number of base pairs. This version of the secondary structure prediction problem is
also called RNA maximum base pair matching problem since we can view a secondary
structure as a matching. In next section, we will introduce a dynamic programming
algorithm to .nd a secondary structure of the maximum number of base pairs.
6.2 The RNA Maximum Base Pair Matching Algorithm
In this section, we shall consider the RNA maximum base pair matching problem
which is defined as follows. Given an RNA sequence R = r1r2 · · ·rn,find a secondary
structure of RNA with the maximum number of base pairs.
Let Si,j denote the secondary structure of the maximum number of base pairs on the
substring Ri,j = riri+1 · · · rj. Denote by Mi,j the size of Si,j (i.e., Mi,j = |Si,j |). Notice that
not any two bases ri and rj, where 1 ≒ i < j ≒ n, can be paired with each other. The
6-5
admissible base pairs we consider here are Watson-Crick base pairs (i.e., A≡U and
G=C) and wobble base pairs (i.e., G–U). Let WW = {(A, U), (U, A), (G, C), (C, G), (G,
U), (U, G)}. Then, we use a function ρ(ri, rj) to indicate whether any two bases ri and
rj can be a legal base pair:
 1 if (ri , rj ) WW
 (ri , rj )  
 0 otherwise
By de.nition, we know that RNA sequence does not fold too sharply on itself. That is,
if j - i ≦ 3, then ri and rj cannot be a base pair of Si,j . Hence, we let Mi,j = 0 if j - i ≦ 3.
To compute Mi,j, where j . i > 3, we consider the following cases from rj point of
view.
Case 1: In the optimal solution, rj is not paired with any other base. In this
case,find an optimal solution for riri+1 . . . rj-1 and Mi,j =Mi,j-1.
Case 2: In the optimal solution, rj is paired with ri and ρ(ri, rj) = 1. In this case,
find an optimal solution for ri+1ri+2 . . . rj-1 and Mi,j = 1+Mi+1,j-1.
Case 3: In optimal solution, rj is paired with some rk, where i + 1 ≒ k ≒ j . 4 and
ρ(rk, rj) = 1. In this case, .nd the optimal solutions for riri+1 . . . rk-1 and rk+1rk+1 . . . rj-1
Figure 6.4: Illustration of Case 1
and
M i , j  1  M i ,k 1  M k 1, j 1
Since we want to find the k between i+ 1 and j . 4 such that Mi,j is the maximum, we
have
6-6
M i , j  max
 1 M
i 1 k  j  4
i , k 1
 M k 1, j 1
In summary, we have the following recursive formula to compute Mi,j .
‧
If j - i ≦ 3, then Mi,j = 0.
‧
Figure 6.5: Illustration of Case 2
Figure 6.6: Illustration of Case 3
‧ If j - i > 3, then
6-7

M i, j
M
i , j 1


 max (1  M i 1, j 1 )   (rk , rj )

max (1  M i , k 1  M k 1, j 1 )   (rk , rj 

i 1 k  j  4
According to the above formula, we can design Algorithm 6.1 to computeM1,n using
the dynamic programming technique. Table 6.1 illustrates the computation of Mi,j,
where 1 ≦ i < j ≦ 10, for an RNA sequence R1,10 = A–G–G–C–C–U–U–C–C–U. As a
result, we can find that the maximum number of base pairs in S1,10 is 3 since M1,10 = 3.
Table 6.1: The Computation of the Maximum Number of Base Pairs of an RNA
Sequence A–G–G–C–C–U–U–C–C–U
Algorithm 6.1 An RNA maximum base pair matching algorithm
Input: An RNA sequence R = r1r2 · · · rn.
Output: Find a secondary structure of RNA with the maximum number of base pairs.
Step 1: /* Computation of ρ(ri, rj) function for 1 ≦ i < j ≦ n */
WW = {(A, U), (U, A), (G, C), (C, G), (G, U), (U, G)};
for i = 1 to n do
for j = i to n do
if (ri, rj)  WW then ρ(ri, rj) = 1; else ρ(ri, rj) = 0;
end for
6-8
end for
Step 2: /* Initialization of Mi,j for j . i ≒ 3 */
for i = 1 to n do
for j = i to i + 3 do
if j ≦ n then Mi,j = 0;
end for
end for
Step 3: /* Calculation of Mi,j for j - i > 3 */
for h = 4 to n - 1 do
for i = 1 to n - h do
j = i + h;
case1 =Mi,j.1;
case2 = (1+Mi+1,j.1) × ρ(ri, rj );
case3 = M i , j  max
 (1  M
i 1 k  j  4
i , k 1
 M k 1, j 1 )   (rk , r j )
Mi,j = max{case1, case2, case3};
end for
end for
In the following, let us illustrate the whole procedure in detail.
r1 r2 r3 r4 r5 r6
A G G C C U
(1) i = 1, j = 5,  r1 , r5     A, C 
r7 r8 r9
U C C
r10
U
M 1, 4
M 1,5  max 
1  M 2, 4    r1 , r5 
 max 0,0  0.
(2) i = 2, j = 6,  r2 , r6    G,U   1
M 2,5
M 2, 6  m a x
1  M 3,5    r2 , r6 
 m a x0, (1  0)  1  m a x0,1  1.
r2 matches with r6 .
(3) i = 3, j = 7,  r3 , r7    G,U   1
6-9

 M 3, 6
M 3, 7  max 
1  M 4, 6    r3 , r7 
 max 0, (1  0)  1  max 0,1  1.
r3 matches with r7 .
(4) i = 4, j = 8,  r4 , r8    C, C   0
M 4, 7
M ,8  max 
1  M 5, 7   r4 , r8 
 max 0, (1  0)  1  0.
(5) i = 5, j = 9,  r5 , r9    C, C   0
 M 5,8
M 5,9  max 
1  M 6,8    r5 , r9 
 max 0, (1  0)  1  max 0,0  0.
(6) i = 6, j = 10,  r6 , r10    U ,U   0
M 6,9
M 6,10  max 
1  M 7 ,9    r6 , r10 
 max 0, (1  0)  0  max 0,0  0.
(7) i = 1, j = 6,  r1 , r6     A,U   1
M 1,6
M 1,5

 max 1  M 2,5    r1 , r6 
1  M  M    r , r 
1,1
3, 5
2 6

 max 0, (1  0)  1, (1  0  0)  1  max 0,1,1  1.
r1 matches with r6 .
6-10
(8) i = 2, j = 7,  r1 , r6    G,U   1
M 2, 7
M 2, 6

 max 1  M 3,6   r2 , r7 
1  M  M   r , r 
2, 2
4, 6
3 7

 max 1, (1  0)  1, (1  0  0)  1  max 1,1,1  1.
(9) i = 3, j = 8,  r3 , r8    G, C   1
M 3, 8
 M 3, 7

 max 1  M 4,7   r3 , r8 
1  M  M   r , r 
3, 3
5, 7
4 8

 max 1, (1  0)  1, (1  0  0)  0  max 1,1,0  1.
r3 matches with r8 .
(10) i = 4, j = 9,  r4 , r9    C , C   0
M 4,9
 M 4 ,8

 max 1  M 5,8   r4 , r9 
1  M  M   r , r 
4, 4
6 ,8
5 9

 max 0, (1  0)  0, (1  0  0)  0
 max 0,0,0  0.
(11) i = 5, j = 10,  r5 , r10    C, C   0
M 5,10
 M 5, 9

 max 1  M 6,9   r5 , r10 
1  M  M   r , r 
5, 5
7 ,9
6 10

 max 0, (1  0)  0, (1  0  0)  0
 max 0,0,0  0.
(12) i = 1, j = 7,  r1 , r7     A,U   0
6-11
M 1, 7
M 1, 6

1  M 2, 6    r1 , r7 
 max 
1  M 1,1  M 3, 6    r2 , r7 
1  M  M   r , r 
1, 2
4, 6
3 7

 max 1, (1  1)  1, (1  0  0)  1, (1  0  0)  1
 max 1,2,1,1
 2.
(13) i = 2, j = 8,  r2 , r8    G, C   1
M 2 ,8
M 2, 7

1  M 3, 7    r2 , r8 
 max 
1  M 2, 2  M 4, 7    r3 , r8 
1  M  M    r , r 
2,3
5, 7
4 8

 max 1, (1  1)  1, (1  0  0)  1, (1  0  0)  0
 max 1,2,1,0
 2.
r2 matches with r8 ; r3 matches with r7 .
(14) i = 3, j = 9,  r3 , r9    G, C   1
M 3, 9
 M 3, 8

1  M 4,8    r3 , r9 
 max 
1  M 3,3  M 5,8    r4 , r9 
1  M  M    r , r 
3, 4
6 ,8
5 9

 max 1, (1  0)  1, (1  0  0)  0, (1  0  0)  0
 max 1,1,0,0
 1.
r3 matches with r9 .
(15) i = 4, j = 10,  r4 , r10    C ,U   0
6-12
M 4,9

1  M 5,9    r4 , r10 
M 4,10  max 
1  M 4, 4  M 6,9    r5 , r10 
1  M  M    r , r 
4,5
7 ,9
6 10

 max 0, (1  0)  0, (1  0  0)  0, (1  0  0)  0
 max 0,0,0,0
 0.
(16) i = 1, j = 8,  r1 , r8     A, C   0
M 1,8
M 1, 7

1  M 2, 7    r1 , r8 

 max 1  M 1,1  M 3, 7    r2 , r8 
1  M  M    r , r 
1, 2
4,7
3 8

1  M 1,3  M 5, 7    r4 , r8 

 max 2, (1  1)  0, (1  0  1)  1, (1  0  0)  1, (1  0  0)  0
 max 2,0,1,1,0
 2.
r1 matches with r7 ; r2 matches with r6 .
(17) i = 2, j = 9,  r2 , r9    G, C   1
M 2,9
 M 2 ,8

1  M 3,8    r2 , r9 

 max 1  M 2, 2  M 4,8    r3 , r9 
1  M  M    r , r 
2,3
5,8
4 9

1  M 2, 4  M 6,8    r5 , r9 

 max 2, (1  1)  1, (1  0  0)  1, (1  0  0)  0, (1  0  0)  0
 max 2,2,1,0,0
 2.
r2 matches with r9 ; r3 matches with r8 .
(18) i = 3, j = 10,  r3 , r10    G,U   1
6-13
 M 3, 9

1  M 4,9    r3 , r10 

M 3,10  max 1  M 3,3  M 5,9    r4 , r10 
1  M  M    r , r 
3, 4
6,9
5 10

1  M 3,5  M 7 ,9    r6 , r10 

 max 1, (1  0)  1, (1  0  0)  0, (1  0  0)  0, (1  0  0)  0
 max 1,1,0,0,0
 1.
r3 matches with r10
(19) i = 1, j = 9,  r1 , r9     A, C   0
M 1,9
M 1,8

1  M 2,8    r1 , r9 
1  M  M    r , r 
1,1
3, 8
2 9

 max 
1  M 1, 2  M 4,8    r3 , r9 
1  M  M    r , r 
1, 3
5,8
4 9

1  M 1, 4  M 6,8    r5 , r9 
 max 2, (1  2)  0, (1  0  1)  1, (1  0  0)  1, (1  0  0)  0, (1  0  0)  0
 max 2,0,1,1,0,0
 2.
r1 matches with r7 ; r2 matches with r6 .
(20) i = 2, j = 10,  r2 , r10    G,U   1
M 2,9

1  M 3,9    r2 , r10 
1  M  M    r , r 
2, 2
4,9
3 10

M 2,10  max 
1  M 2,3  M 5,9    r4 , r10 
1  M  M    r , r 
2, 4
6,9
5 10

1  M 2,5  M 7 ,9    r6 , r10 
 max 2,2,1,1,0,0
 2.
r2 matches with r10 ; r3 matches with r9
6-14
(21) i = 1, j = 10,  r1 , r10     A,U   1
M 1,9

1  M 2,9    r1 , r10 
1  M  M    r , r 
1,1
3, 9
2 10

M 1,10  max 1  M 1, 2  M 4,9    r3 , r10 
1  M  M    r , r 
1, 3
5, 9
4 10

1  M 1, 4  M 6,9    r5 , r10 

1  M 1,5  M 7 ,9    r6 , r10 
 max 2,3,2,1,0,0,0
 3.
r1 matches with r10 ; r2 matches with r9 ; r3 matches with r8 .
In the following, we analyze the time-complexity of Algorithm 6.1. Clearly, the
cost ofStep 1 is O(n). For Step 2, there are at most
 
n
n
i 1
j i  4
iterations and each iteration costs  (j . i) time. Hence, the cost of
Step 2 is
 
n
n
i 1
j i  4
( j  i)  (n 3 )
Hence, the total time-complexity of Algorithm 6.1 is (n 3 ) .
6.3 Loop Dependent Free Energy Rules
Let us consider Figure 6.2. The secondary structure in Figure 6.2 (c) is similar to that
in Figure 6.2 (f). Yet it is obvious that the structure in Figure 6.2 (f) is much better
than that in Figure 6.2 (c) because there is a base, namely base C, inside the loop
A–G–C–C–U of Figure 6.2 (c), which is not paired with any other base. On the other
hand, there is no such an unpaired base in the loop A–G–C–U of Figure 6.2 (f). This
is shown in Figure 6.7.
6-15
Figure 6.7: The Loop Marked in Structure (b) Is More Stable Than That in
Structure(a)
When two base pairs are consecutive, we say that there is a stacking interaction
between them and this interaction increases the stability of the secondary structure.
We therefore prefer loops with stacking interaction.
Suppose that (ri, rj) is a base pair in S. A base rp, where i < p < j, is accessible from
(ri, rj) if there is no other base pair (ri_, rj_) in S such that i < i_ < p < j_ < j.
Similarly, we say that the base pair (rp, rq) is accessible from (ri, rj) if both rp and rq
are accessible from (ri, rj ). For instance, in Figure 6.8, bases r3 and r8 are accessible
from base pair (r2, r9) (i.e., base pair (r3, r8) is accessible from (r2, r9)), but other bases
are not accessible from (r2, r9).
Given a secondary structure, the loop is a substructure consisting of (ri, rj) and all
bases accessible from (ri, rj). It can be seen that every base pair (ri, rj) corresponds to a
loop and (ri, rj) is called the exterior (or closing) base pair of this loop and other base
pairs accessible from (ri, rj) are called interior base pairs of this loop. The size and
degree of a loop are de.ned to be the number of unpaired bases and the number of
base pairs in the loop, respectively. For example, the secondary structure of Figure 6.8
contains the following three loops.
Loop 1: {r1, r2, r9, r10} (i.e., A–G–C–U),
Loop 2: {r2, r3, r8, r9} (i.e., G–G–C–C),
Loop 3: {r3, r4, r5, r6, r7, r8} (i.e., G–C–C–U–U–C),
where the exterior base pair, interior base pair, size and degree of each loop are listed
in Table 6.2.
6-16
According to the degree and size, the loops can be further distinguished as one of
the
Figure 6.8: A Secondary Structure with Three Loops
Table 6.2: The Exterior BP (Base Pair), Interior BP, Size and Degree of Loops 1, 2
and 3 in Figure 6.8
following different types.
(1) Hairpin loop: A loop of degree 1 is called a hairpin loop. See Figure 6.9 (a).
(2) Stacked pair: A loop of degree 2 is called a stacked pair if its size is zero. See
Figure 6.9 (b).
(3) Bulge loop: A loop of degree 2 and non-zero size is called a bulge loop if its
exterior and interior base pairs are adjacent. See Figure 6.9 (c).
(4) Interior loop: A loop of degree 2 and non-zero size is called an interior loop if its
exterior and interior base pairs are not adjacent. See Figure 6.9 (d).
(5) Multiloop: A loop of degree greater than 2 is called a multiloop. See Figure 6.9
(e).
In a secondary structure, the collection of adjacent unpaired bases which are not
accessible by any base pair is called an external element (see Figure 6.10). For
6-17
convenience, we can view the external element as a special type of loop, called an
exterior loop. It is not hard to see that any secondary structure S can be uniquely
decomposed into loops (see Figure 6.10). If we assign an energy to each loop in S,
then the free energy of S is assumed to be the sum of the energies of all loops.
Since the energy of a folded structure is measured relatively to the unfolded
sequence, exterior loops do not contribute any energy. Hence, we assume that the
energies of exterior loops are zero. The energies of other loops depend only on the
size and type of the loop and are usually determined by experiment. We use the
following notation to denote the energies of various loops.
Figure 6.9: Various Types of Loops (with Exterior Base Pair (ri, rj)): (a) A Hairpin
Loop of Degree 1 and Size 4. (b) A Stacked Pair of Degree 2 and Size 0. (c) A Bulge
Loop of Degree 2 and Size 4. (d) An Interior Loop of Degree 2 and Size 8. (e) A
Multiloop of Degree 4 and Size 12.
6-18
Figure 6.10: The Secondary Structure of an RNA Can Be Uniquely Decomposed into
Loops
(1) Hairpin loops: We use H(k) to denote the energy of a hairpin loop with size k.
(2) Stacked pairs: We use S to denote the energy of a stacked pair.
(3) Bulge Loops : We use B(k) to denote the energy of a bugle loop with size k.
(4) Interior loops: We use I(k) to denote the energy of an interior loop with size k.
(5) Multiloop: We use M to denote the energy of a multiloop, which usually
expressed by the following affine penalty function.
M  M E  M I  (deg ree  1)  M B  size
where ME,MI and MB are constants, and degree and size are the degree and
size of the loop, respectively. In the right hand side of above equation,
‧ the first term ME denotes the stabilizing contribution from the exterior base pair,
‧ the second term MI ×(degree - 1) denotes the stabilizing contributions from interior
base pairs (there are (degree - 1) interior base pairs), and
‧
the last termMB×size denotes the destabilizing contributions from unpaired bases
(there are size unpaired bases).
In next section, we shall introduce a dynamic programming algorithm to find a
secondary structure with the minimum free energy.
6.4 Minimum Free Energy Algorithm
In this section, we adopt the loop dependent free energy rules to describe the
secondary structure prediction problem. In this model, any secondary structure can be
uniquely decomposed into loops and the free energy of the structure is the sum of the
6-19
energies of all loops. Then the problem is to find an optimal secondary structure (i.e.,
a secondary structure with the minimum free energy). In the following, we shall
introduce a dynamic
programming algorithm to solve this problem.
Recall that the admissible base pairs we consider here are Watson-Crick base pairs
(i.e., G≡C and A=U) and wobble base pairs (i.e., G–U) and we use a function ρ(ri, rj)
to indicate whether any two bases ri and rj can be a legal base pair:
if (ri , r j )  WW
1
 (ri , r j )  
  otherwise
where
WW = {(A, U), (U, A), (G, C), (C, G), (G, U), (U, G)}
Let Si,j denote the optimal structure of the substring Ri,j = riri+1 · · · rj. We use Ei,j to
denote the free energy of Si,j . To compute Ei,j , we consider the following three cases
from rj point of view.
Case 1: In the optimum solution, rj is not paired with any other base. Then we have
Ei,j = Ei,j-1.
Case 2: In the optimum solution, ri is paired with rj and ρ(ri, rj) = 1. Then there may
be one or more loops between ri and rj . For simplicity, we use Li,j to denote the
structure with the minimum free energy in this case and use Fi,j to denote the free
energy of Li,j . (The calculation of Fi,j will be described later.) Then we have Ei,j = Fi,j .
Case 3: In the optimum solution, ri is paired with some rk, where i+1  k  j-4,
and ρ(rk, rj) = 1. In this case, we can divide Ri,j into two subsequences Ri,k.1 and Rk,j
such that
 i , j   i ,k 1  Fk , j
Since we want to .nd the k between i + 1 and j - 4 such that  i, j is the minimum, we
have
 i , j  min { i , k 1  Fk , j }
i 1 k  j  4
In summary, we have the following recursive formula to compute  i, j
6-20
 i, j

 i , j 1
 min  Fi , j   (ri , rj )

{( i , k 1  Fk , j )   (rk , rj )
i 1min
k  j 4
By definition, ri and rj cannot form a base pair if j - i  t = 3 since Ri,j does not fold
itself too sharply. Hence, we have to set the boundary conditions of functions  and
F as follows.
 i , j  Fi , j  
if
j i 3
Next, we explain how to compute Fi,j in detail. Since (ri, rj) is a base pair in Li,j ,(ri,
rj) must be an exterior base pair of some one loop, say L. According to the loop type
of L, we consider the following four cases.
Case 1: L is a hairpin loop. Since the size of L is j.i.1, we have Fi,j = H(j-i-1).
Figure 6.11: Illustration of Case 1
Case 2: L is a stacked pair. Then we have Fi,j = S + Fi+1,j-1.
Figure 6.12: Illustration of Case 2
6-21
Case 3: L is a bugle loop. Let (rp, rq) be the interior base pair of L. By de.nition,(ri, rj)
and (rp, rq) are adjacent with either p = i + 1 or q = j - 1 (but not both).
(1) Suppose that p = i+1 and q _= j.1. Then the range of q is [p+4, j.2] = [i+5, j.2]
and the size of L is j - q - 1. Hence, we have
Fi , j  min {( j  q  1)  Fi1,q }
i 5 p j 2
(2) Suppose that q = j.1 and p _= i+1. Then the range of p is [i+2, q-4] = [i+2, j.5] and
the size of L is p - i - 1. Hence, we have
Fi , j  min {( p  i  1)  Fp, j 1}
i 2 p j 5
Figure 6.13: Illustration of Case 3
As a result, we have
 min {( j  q  1)  Fi1,q
i5q j 2
Fi , j  min 
{( p  i  1)  Fp , j 1
i2min
 p j 5
Case 4: L is an interior loop. Let (rp, rq) be the interior base pair of L. Then we have
i+1  p+3 < q  j -1 and the size of L is p- i+ j -q - 2. Since, by definition, (ri, rj)
and (rp, rq) are not adjacent, we have p . i + j . q ≡ 4. Hence, we have
Fi , j 
min {( p  i  j  q  2)  Fp ,q }
i 1 p 3 j 1
p 1 j q4
6-22
Figure 6.14: Illustration of Case 4
Case 5: L is a multiloop. Suppose that (rp, rq) is the rightmost interior base pair of L
(see Figure 6.15). In this case, we can represent the minimum free energy of L as
follows.
Fi , j 
min {( p  i  j  q  2)  Fp ,q }
i 1 p 3 j 1
p 1 j q4
where
‧ g 1p , j 1 is the minimum free energy of the substructure Lp,q plus the energies
contributed from interior base pair (rp, rq) and unpaired bases rq+1rq+2 · · · rj.1. That is,
g 1p , j 1  min {Fp ,q  M I  M B  ( j  q  1)}
pq j
‧ g i21, p1 is the minimum free energy of the remaining section of L (the calculation of
g i21, p1 will be described later).
6-23
Figure 6.15: Illustration of Case 5: (rp, rq) Is the Rightmost Interior Base Pair of
Multiloop
In the following, we will discuss the calculation of g i21, p1 , which is the minimum
free energy of the remaining section L_ of L. This section L_ may contain one or more
loops.
Case 1: Suppose that L_ contains only one loop. Then g i21, p1 is equal to g 1k , p 1 plus
the energies contributed from unpaired bases ri+1ri+2 · · · rk.1. That is,
g i21, p 1  min {g 1k , p 1  M B  (k  i  1)}
ik  p
Figure 6.16: Illustration of Case 1
6-24
Case 2: Suppose that L' contains two or more loops. Then we have
g i21, p 1  min {g 1k , p 1  g i21,k 1}
ik  p
Figure 6.17: Illustration of Case 2
In summary, we have
g
2
i 1, p 1
1

min i k  p {g k , p 1  M B  (k  i  1)}
 min 
1
2

min u k  p {g k , p 1  g i 1,k 1}
According to the discussion above, we have the following recursive formula to
compute Fi,j .
‧ If j . i  3, then Fi,j = +∞.
‧ If j . i > 3, then
Fi , j

 H ( j  i  1)

S  Fi 1, j 1

 min {B( j  q  1)  Fi 1,q 


i 5 q  j  2

 min 

{B( p  i  1)  F p , j 1 

i  2min
 p  j 5


{( p  i  j  q  2)  F p ,q }
i 1 pmin
 3 q  j 1
 p i  j  q  4

{M E  g 1p , j 1  g i21, p 1 }
imin
 p j
where
g 1p , j 1  min {Fp ,q  M I  M B  ( j  q  1)}
pq j
6-25
g
2
i 1, p 1
1

min i k  p {g k , p 1  M B  (k  i  1)}
 min 
1
2

min u k  p {g k , p 1  g i 1,k 1}
By the above recursive formula, we can compute all Fi,j, 1  i  j  n
in (n 4 )
time using dynamic programming technique. Here, we assume that the values of Fi,j
have been computed in advance and design Algorithm 6.2 to compute the minimum
free energy  1,n of an RNA sequence R1,n = r1r2 · · · rn. We analyze the
time-complexity of Algorithm 6.2 as follows. Clearly, the costs of Steps 1 and 2 are
O(n). For Step 3, there are at most
n
n
i 1
j i  4
 
iterations and each iteration costs O(j . i) time. Hence, the
cost of Step 3 is
n
n
i 1
j i  4
 
( j  i)  (n 3 )
However, the preprocessing of Fi,j costs
time-complexity of Algorithm 6.2 is (n 4 ) .
(n 4 )
time. Hence, the total
Algorithm 6.2 An RNA minumum free engergy algorithm
Input: An RNA sequence R = r1r2 · · · rn.
Output: The minimum free energy  i,n of sequence R.
Step 1: /* Computation of ρ(ri, rj) function for 1  i < j  n */
WW = {(A, U), (U, A), (G, C), (C, G), (G, U), (U, G)};
for i = 1 to n do
for j = i to n do
if (ri, rj)
then ρ(ri, rj) = 1; else ρ(ri, rj) = +∞;
end for
end for
Step 2: /* Initialization of  i, j for j - i  3 */
for i = 1 to n do
for j = i to i + 3 do
if j  n then  i, j = +∞;
6-26
end for
end for
Step 3: /* Calculation of  i, j for j - i > 3 */
for i = n . 4 downto 1 do
for j = i + 4 to n do
case1 =  i , j 1 ;
case2 = Fi,j × ρ(ri, rj);
case3 =
min {( i ,k 1  Fk , j )   (rk , r j )
i 1 k  j  4
 i, j = min{case1, case2, case3};
end for
end for
6.5
The RNA Secondary Structure Prediction with Simple
Pseudoknots
In the previous sections, we discussed RNA secondary structure prediction.
The model is called the RNA secondary structure without pseudoknots.
In this
model, given a sequence of amino acids a1 a1  a n , if ( a i , a j ) is a base pair and
(ah , ak ) is another base pair, and i < h, then i  h  k  j .
An example is
illustrated in Fig. 6.18.
3
4
2
5
1
6
1
2
(a)
3
4
5
6
(b)
Fig. 6.18. An RNA Secondary Structure without Pseudoknots and its Schemeatic
6-27
Diagram.
Fig. 6.19 shows an RNA secondary structure with pseudoknots. In this model,
suppose that ( a i , a j ) and (ah , ak ) are two base pairs in an RNA secondary
structure. Then it is possible that i  h  j  k , as illustrated in Fig. 6.19.
6
3
2
4
5
1
2
3
4
5
6
1
(a)
(b)
Fig. 6.19. An RNA Secondary Structure with Simple Pseudoknots
and Its Schematic Diagram
The model in Fig. 6.19 is actually not general enough. Fig. 6.20 shows a more
general case for pseudoknots. The model in Fig. 6.19 is called the simple
pseudoknot model while the model in Fig. 6.20 is a recursive pseudoknot model. We
shall only discuss the simple pseudoknot model in this book.
6-28
14
4
13
3
12
11
9
10
5
8
6
2
7
1
(a)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(b)
Fig. 6.20.
An RNA Secondary Structure with Recursive Pseudoknots
and Its Schematic Diagram
If we compare the three models, we may make some conclusion as follows:
1.
2.
3.
4.
The RNA itself is considered as a linear sequence if its secondary structure is
not considered.
If the RNA secondary structure is considered, there is at most one turn in the
model without pseudoknots. That is, the terminal amino acid may turn
back to the starting amino acid. See Fig. 6.18.
In the RNA secondary structure with simple pseudoknots, there are at most
two turns. That is, the sequence turns and turns again. See Fig. 6.19.
In the RNA secondary structure with recursive pseudoknots, there may be
more than two turns. See Fig. 6.20.
Formal Definition of Simple Pseudoknots
6-29
Consider an RNA sequence a1a2 an . This sequence has a secondary structure
with simple pseudoknots if there exist j1 and j 2 such that the following conditions
are satisfied:
1.
Each base pair ( a i , a j ) must be such that either 1  i  j1  j  j2 or
j1  i  j2  j  n .
2.
If
base pairs ( a i , a j ) and ( a i ' , a j ' ) are such that i  i'  j1 or j2  i  i' , then
j  j '.
Conditions 1 and 2 mentioned above are now illustrated in Fig. 6.21 and Fig.
6.22 respectively.
1
i
j1
j
j2
n
(a)
Fig. 6.21
1
j1
j2
i
(b)
Condition 1 of the Definition of Simple Pseudoknots
6-30
j
n
1
i
i'
j1
j'
j
j2
n
j2
j'
j
n
(a)
1
j1
i
i'
(b)
Fig. 6.22
Condition 2 of the Definition of Simple Pseudoknots
To give the reader some feeling of the above discussion, let us consider the
following RNA sequence:
S: CUUCAUCAGGAAAUGAC
A secondary structure of the above sequence without pseudoknots is shown in
Fig. 6.23 and that with simple pseudoknots is shown in Fig. 6.24. In Fig.6.24, we
have also indicated the positions of j1 and j 2 .
G
A
G
A
C
A
U
A
A
U
C
G
U
U
A
C
C
6-31
Fig. 6.23
Secondary Structure of Sequence S without Pseudoknots
C
j1
U
A
C
G
A
A
U
C
G
U
G
U
A
C
A
A
Fig.6.24
j2
A Secondary Structure of Sequence S with Simple Pseudoknots
It can be seen from the above two figures that in this case, the secondary
structure with simple pseudoknots contains more base pairs than that without
pseudoknots.
Of course, given an RNA sequence, there may be many different secondary
structures corresponding to this sequence. An optimal RNA secondary structure
with simple psudoknots is a secondary structure which contains maximum number of
base pairs among all secondary structures with simple pseudoknots. The RNA
secondary structure with simple pseudoknots problem is thus defined as follows:
Given an RNA, find an optimal secondary structure with simple pseudoknots for this
sequence.
The above problem is a polynomial problem and can be solved by a dynamic
programming algorithm which will be introduced below.
Given an RNA sequence S  a1a1 an , let us consider three amino acids
a i , a j and a k where 1  i  j  k  n .
There are three cases:
6-32
Case 1. ( a i , a j ) is a base pair.
Let L(i, j , k ) denote the number of base pairs of
an optimal secondary structure with simple pseudoknots of
S
from a i to a k
corresponding to Case 1.
Case 2. Neither ( a i , a j ) nor (a j , a k ) is a base pair.
Let M (i, j , k ) denote the
number of base pairs of an optimal secondary structure of S with simple pseudoknots
from a i to a k corresponding to Case 2.
Case 3. (a j , a k ) is a base pair.
Let R(i, j , k ) denote the number of base pairs of
an optimal secondary structure of
S with simple pseudoknots from a i to a k
corresponding to Case 3.
Perhaps it is significant to pause here to think back about the model without
pseudoknots. In that case, if we use the dynamic programming approach, we always
start to think whether (ai , a k ) is a base pair. In the model with simple pseudoknots,
we do not consider whether (ai , a k ) is a base pair. Instead, we consider whether
( a i , a j ) is a base pair and whether (a j , a k ) is a base pair.
In the following, let v(ai , a j )  1 if ( a i , a j ) is a base pair; otherwise, let
v(ai , a j )   . Consider Case 1.
We may think along the following line:
1. Since ( a i , a j ) is a base pair, we may consider the optimal secondary structure
with simple pseudoknots where (ai 1 , a j 1 ) is a base pair, as shown in Fig.
6.25(a).
2. Since ( a i , a j ) is a base pair, we may consider the optimal secondary structure
with simple pseudoknots where neither (ai 1 , a j 1 ) nor ( a j 1 , a k ) is a base pair,
as shown in Fig. 6.25(b).
3. Since ( a i , a j ) is a base pair, we may consider the optimal secondary structure
with simple pseudoknots where ( a j 1 , a k ) is a base pair as shown in Fig. 6.25(c).
6-33
k
k
k
i
j
i
j
i
j
i-1
j+1
i-1
j+1
i-1
j+1
(a)
(b)
(c)
Fig. 6.25 Recursive formulas for L(i,j,k)
The above discussion shows that we have the following recursive formula:
L(i  1, j  1, k ) 


L(i, j, k )  v(ai , a j )  max M (i  1, j  1, k )
R(i  1, j  1, k ) 


(1)
An illustration of the equation for L(i,j,k) is shown in Fig. 6.25.
Using similar reasoning, we have
L(i, j  1, k  1 


R(i, j, k )  v(a j , a k )  max M (i, j  1, k  1)
R(i, j  1, k  1) 


(2)
The above equation is illustrated in Fig.6.26.
k
i
j
k-1
i
j
j+1
(a)
k
k
k-1
i
j
j+1
j+1
(b)
6-34
(c)
k-1
Fig. 6.26. Recursive Formulas for R(i,j,k)
Finally, we have the following recursive formula for M(i,j,k).
M (i  1, j, k ), M (i, j  1, k ), M (i, j, k  1)


m(i, j, k )  max L(i  1, j, k ), L(i, j  1, k )

R(i, j  1, k ), R(i, j, k  1)



(3)
We need some formulas for the boundary conditions:
1. L(i, j , j )  v(ai , a j ) for all i<j .
(4)
2. L(i0  1, j , k )  R(i0  1, j, k )  M (i 0,1, j, )  0 if k=j or k=j+1.
(5)
Let A(i0 , k 0 ) denote the number of base pairs of in optimal RNA secondary
structure with simple pseudoknots from ai 0 to a k 0 . Obviously,
A(i0 , j0 ) 
max
L(i, j, k ), M (i, j, k ), R(i, j, k )
i 0i  j k k 0
(6)
Given an RNA S  a1a2 an , an optimal secondary structure of S can be
determined by using Equation (6).
In the following, we give an algorithm which produces an optimal RNA
secondary structure.
Algorithm 6.3: An algorithm which produces an optimal RNA secondary structure
with simple pseudoknots.
Input: An RNA sequence S= a1 , a2 ,, an .
Output: An optimal secondary structure of RNA with simple pseudoknots
Step1:
WW={(A,U),(U,A),(G,C),(C,G),(G,U),(U,G)};
for i = 1 to n do
for j = i+2 to n do
if (ai , a j )  WW then v ( a i , a j ) =1; else v ( a i , a j ) =0;
6-35
end for
end for
Step2:
for i = 1 to n - 2 do
for j = i + 1 to n - 1 do
for k = j + 1 to n do
case1= L( i, j, k );
case2= R( i, j, k );
case3= m( i, j, k );
Ai , k =max{case1,case2,case3};
Tracepairi,k=max{ Lpairi,j,k , Mpairi,j,k , Rpairi,j,k } ;
end for
end for
end for
L( i, j, k )
{
if i = 0 and ( k = j or k = j + 1 ) then Li,j,k=0 return ;
if k=j
and i<j then
if v ( a i , a j ) >0 then Lpairi,j,k = i , j ;
Li,j,k= v ( a i , a j ) return ;
else{
case1=L( i - 1, j + 1, k ) ;
case2=R( i - 1, j + 1, k ) ;
case3=m( i - 1, j + 1, k ) ;
Li,j,k = v ( a i , a j ) + max{case1,cas2,case3} ;
Lpairi,j,k = i , j+max{ Lpairi-1,j+1,k , Mpairi-1,j+1,k , Rpairi-1,j+1,k } ;
return ;
}
}
6-36
R( i, j, k )
{
if i = 0 and ( k = j or k = j + 1 ) then Ri,j,k=0 return ;
if k=j+1 then Ri,j,k=0 return ;
else
{
case1=L( i, j + 1, k – 1 ) ;
case2=R( i, j + 1, k – 1 ) ;
case3=m( i, j + 1, k – 1 ) ;
Ri,j,k = v(a j , ak ) + max{case1,cas2,case3} ;
Rpairi,j,k = j , k+max{ Lpairi,j+1,k-1 , Mpairi,j+1,k-1 , Rpairi,j+1,k-1 } ;
return ;
}
}
m( i, j, k )
{
if i = 0 and ( k = j or k = j + 1 ) then mi,j,k=0 return ;
case1= m( i-1, j, k ) ;
case2= m( i, j+1, k ) ;
case3= m( i, j, k-1 ) ;
case4= L( i-1, j, k ) ;
case5= L( i, j+1, k ) ;
case6= R( i, j+1, k ) ;
case7= R( i, j, k-1 ) ;
mi,j,k= max{case1,case2,case3,case4,case5,case6,case7};
Mpairi,j,k = max{ Mpairi-1,j,k , Mpairi,j+1,k , Mpairi,j,k-1 , Lpairi-1,j,k ,
Lpairi,j+1,k , Rpairi,j+1,k, Rpairi,j,k-1 };
return ;
}
The following is a printing of the output of applying the above algorithm to the
following sequence:
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16
C U U C A U C A G G A A A U G A
6-37
C
Tracepair : (1,10), (2,9), (3,8), (5,15), (6,14), (7,13)
The result is the same as shown in Fig. 6.24.
6-38
Download