6.5 The RNA Secondary Structure Prediction with Simple Pseudoknots

CHAPTER 6 The Secondary Structure Prediction of RNA In Chapter 1, we mentioned that there are many different types of RNAs, such as mRNA, tRNA and rRNA each with the different function. It is well known that the function of an RNA is determined by its three-dimensional structure. Hence, knowing the three-dimensional structures of RNAs is important for us to understand their functions. The three-dimensional structure of an RNA can be predicted experimentally by X-ray crystallography and nuclear magnetic resonance (NMR). But these experimental methods are difficult and quite time consuming. Moreover, they are not always feasible because X-ray crystallography can be applied only to crystallized molecules and NMR is limited to small molecules (< 200 amino acids at present). It was discovered that the three-dimensional of an RNA can be uniquely determined from its sequence (i.e., the primary sequence). Hence, much theoretical effort has been made in determining the three-dimensional structure of an RNA from its sequence alone. Up to now, it is still a hard work to predict the three-dimensional structure of an RNA directly from its sequence. However, there are efficient algorithms to predict the secondary structure of an RNA, which is useful in predicting the three-dimensional structure. To predict the three-dimensional structure of an RNA sequence, we can first determine its secondary structure and then predict its three-dimensional structure according to this secondary structure (see Figure 6.1). 6-1 Figure 6.1: Folding of Phenylalanyl-Transfer-RNA into Its Spatial Structure. (a) The Primary Structure of the RNA (b) The Secondary Structure of the RNA (c) The Three-Dimensional Structure of the RNA RNA is a single strand of nucleotides (bases) adenine (A), guanine (G), cytosine (C) and uracil (U). The sequence of the bases A, G, C and U is called the primary structure of an RNA. In RNA, G and C can form a base pair G≡C by a triple-hydrogen bond, A and U can form a base pair A=U by a double-hydrogen bond, and G and U can form a base pair G.U by a single hydrogen bond. Due to these hydrogen bonds, the primary structure of an RNA can fold back on itself to form its secondary structure. For example, suppose that we have the following RNA sequence. A–G–G–C–C–U–U–C–C–U Then, this sequence can fold back on itself to form many possible secondary structures. In Figure 6.2, we show six possible secondary structures of this sequences. In nature, however, there is only one secondary structure to correspond to an RNA 6-2 sequence. What is the actual secondary structure of an RNA sequence? Figure 6.2: Six Possible Secondary Structures of RNA Sequence A–G–G–C–C–U–U–C–C–U (The Dashed Lines Denote the Hydrogen Bonds) According to the thermodynamic hypothesis, the actual secondary structure of an RNA sequence is the one with the minimum free energy, which will be explained later. In nature, only the stable structure can exist and the stable strucutre must be the one with the minimum free energy. In a secondary structure of an RNA, the base pairs will increase the structural stability, but the unpaired bases will decrease the structural stability. The base pairs of the types G≡C and A=U (called Watson-Crick base pairs) are more stable than that of the type G.U (called wobble base pairs). According to these factors, we can find that the secondary structure of Figure 6.2 (f) is the actual secondary structure of sequence A–G–G–C–C–U–U–C–C–U. Hence, we formally define the secondary structure prediction problem as follows. Given an RNA sequence, determine the secondary structure of the minimum free energy from this sequence. In the following sections, we shall give a formal definition of secondary structure of an RNA and then introduce some efficient algorithms to predict the secondary structure of an RNA sequence. 6-3 6.1 Secondary Structure of RNA An RNA sequence will be represented as a string of n characters R = r1r2 · · ·rn, where ri  {A, C, G, U}. Typically, n can range from 20 to 2000. A secondary structure of R is a set S of base pairs (ri, rj), where 1 ≒ i < j ≒ n, such that the following conditions are satisfied. (1) j- i > t, where t is a small positive constant. Typically, t = 3. (2) If (ri, rj) and (rk, rl) are two base pairs in S and i ≒ k, then either (a) i = k and j = l, i.e., (ri, rj) and (rk, rl) are the same base pair, (b) i < j < k < l, i.e., (ri, rj) precedes (rk, rl), or (c) i < k < l < j, i.e., (ri, rj) includes (rk, rl). The first condition implies that RNA sequence does not fold too sharply on itself. The second condition means that each nucleotide can take part in at most one base pair, and guarantees that the secondary structure contains no pseudoknot. Two base pairs (ri, rj) and (rk, rl) are called a pseudoknot if i < k < j < l (see Figure 6.3). Pseudoknots do occur in RNA molecules, but their exclusion simpli.es the problem. By the above de.nition, a secondary structure can be represented as an outerplanar graph with degree at most 3, where an outerplanar graph is a graph which can be drawn in the plane in such a way that all vertices (i.e., nucleotides) are arranged on a circle and all edges (i.e., base pairs) lie inside the circle and do not intersect. Note that the example in Figure 6.3, which contains a pseudoknot, cannot be represented as an outerplanar graph. 6-4 Figure 6.3: An Example of a Pseudoknot Recall that the goal of the secondary structure prediction is to find a secondary structure with the minimum free energy. Hence we must have a method to calculate the free energy of a secondary structure S. Since the formations of base pairs give stabilizing effects to the structural free energy, the simplest method of measuring the free energy of S is to assign an energy to each base pair of S and then the free energy of S is the sum of the energies of all base pairs. Due to different hydrogen bonds, the energies of base pairs are usually assigned as different values. For example, the reasonable values for A≡U, G=C and G–U are -3, -2 and -1 (Kcal/mole), respectively. Other possible values might be that the energies of base pairs are all equal. In this case, the problem becomes the one of finding a secondary structure with the maximum number of base pairs. This version of the secondary structure prediction problem is also called RNA maximum base pair matching problem since we can view a secondary structure as a matching. In next section, we will introduce a dynamic programming algorithm to .nd a secondary structure of the maximum number of base pairs. 6.2 The RNA Maximum Base Pair Matching Algorithm In this section, we shall consider the RNA maximum base pair matching problem which is defined as follows. Given an RNA sequence R = r1r2 · · ·rn,find a secondary structure of RNA with the maximum number of base pairs. Let Si,j denote the secondary structure of the maximum number of base pairs on the substring Ri,j = riri+1 · · · rj. Denote by Mi,j the size of Si,j (i.e., Mi,j = |Si,j |). Notice that not any two bases ri and rj, where 1 ≒ i < j ≒ n, can be paired with each other. The 6-5 admissible base pairs we consider here are Watson-Crick base pairs (i.e., A≡U and G=C) and wobble base pairs (i.e., G–U). Let WW = {(A, U), (U, A), (G, C), (C, G), (G, U), (U, G)}. Then, we use a function ρ(ri, rj) to indicate whether any two bases ri and rj can be a legal base pair:  1 if (ri , rj ) WW  (ri , rj )    0 otherwise By de.nition, we know that RNA sequence does not fold too sharply on itself. That is, if j - i ≦ 3, then ri and rj cannot be a base pair of Si,j . Hence, we let Mi,j = 0 if j - i ≦ 3. To compute Mi,j, where j . i > 3, we consider the following cases from rj point of view. Case 1: In the optimal solution, rj is not paired with any other base. In this case,find an optimal solution for riri+1 . . . rj-1 and Mi,j =Mi,j-1. Case 2: In the optimal solution, rj is paired with ri and ρ(ri, rj) = 1. In this case, find an optimal solution for ri+1ri+2 . . . rj-1 and Mi,j = 1+Mi+1,j-1. Case 3: In optimal solution, rj is paired with some rk, where i + 1 ≒ k ≒ j . 4 and ρ(rk, rj) = 1. In this case, .nd the optimal solutions for riri+1 . . . rk-1 and rk+1rk+1 . . . rj-1 Figure 6.4: Illustration of Case 1 and M i , j  1  M i ,k 1  M k 1, j 1 Since we want to find the k between i+ 1 and j . 4 such that Mi,j is the maximum, we have 6-6 M i , j  max  1 M i 1 k  j  4 i , k 1  M k 1, j 1 In summary, we have the following recursive formula to compute Mi,j . ‧ If j - i ≦ 3, then Mi,j = 0. ‧ Figure 6.5: Illustration of Case 2 Figure 6.6: Illustration of Case 3 ‧ If j - i > 3, then 6-7  M i, j M i , j 1    max (1  M i 1, j 1 )   (rk , rj )  max (1  M i , k 1  M k 1, j 1 )   (rk , rj   i 1 k  j  4 According to the above formula, we can design Algorithm 6.1 to computeM1,n using the dynamic programming technique. Table 6.1 illustrates the computation of Mi,j, where 1 ≦ i < j ≦ 10, for an RNA sequence R1,10 = A–G–G–C–C–U–U–C–C–U. As a result, we can find that the maximum number of base pairs in S1,10 is 3 since M1,10 = 3. Table 6.1: The Computation of the Maximum Number of Base Pairs of an RNA Sequence A–G–G–C–C–U–U–C–C–U Algorithm 6.1 An RNA maximum base pair matching algorithm Input: An RNA sequence R = r1r2 · · · rn. Output: Find a secondary structure of RNA with the maximum number of base pairs. Step 1: /* Computation of ρ(ri, rj) function for 1 ≦ i < j ≦ n */ WW = {(A, U), (U, A), (G, C), (C, G), (G, U), (U, G)}; for i = 1 to n do for j = i to n do if (ri, rj)  WW then ρ(ri, rj) = 1; else ρ(ri, rj) = 0; end for 6-8 end for Step 2: /* Initialization of Mi,j for j . i ≒ 3 */ for i = 1 to n do for j = i to i + 3 do if j ≦ n then Mi,j = 0; end for end for Step 3: /* Calculation of Mi,j for j - i > 3 */ for h = 4 to n - 1 do for i = 1 to n - h do j = i + h; case1 =Mi,j.1; case2 = (1+Mi+1,j.1) × ρ(ri, rj ); case3 = M i , j  max  (1  M i 1 k  j  4 i , k 1  M k 1, j 1 )   (rk , r j ) Mi,j = max{case1, case2, case3}; end for end for In the following, let us illustrate the whole procedure in detail. r1 r2 r3 r4 r5 r6 A G G C C U (1) i = 1, j = 5,  r1 , r5     A, C  r7 r8 r9 U C C r10 U M 1, 4 M 1,5  max  1  M 2, 4    r1 , r5   max 0,0  0. (2) i = 2, j = 6,  r2 , r6    G,U   1 M 2,5 M 2, 6  m a x 1  M 3,5    r2 , r6   m a x0, (1  0)  1  m a x0,1  1. r2 matches with r6 . (3) i = 3, j = 7,  r3 , r7    G,U   1 6-9   M 3, 6 M 3, 7  max  1  M 4, 6    r3 , r7   max 0, (1  0)  1  max 0,1  1. r3 matches with r7 . (4) i = 4, j = 8,  r4 , r8    C, C   0 M 4, 7 M ,8  max  1  M 5, 7   r4 , r8   max 0, (1  0)  1  0. (5) i = 5, j = 9,  r5 , r9    C, C   0  M 5,8 M 5,9  max  1  M 6,8    r5 , r9   max 0, (1  0)  1  max 0,0  0. (6) i = 6, j = 10,  r6 , r10    U ,U   0 M 6,9 M 6,10  max  1  M 7 ,9    r6 , r10   max 0, (1  0)  0  max 0,0  0. (7) i = 1, j = 6,  r1 , r6     A,U   1 M 1,6 M 1,5   max 1  M 2,5    r1 , r6  1  M  M    r , r  1,1 3, 5 2 6   max 0, (1  0)  1, (1  0  0)  1  max 0,1,1  1. r1 matches with r6 . 6-10 (8) i = 2, j = 7,  r1 , r6    G,U   1 M 2, 7 M 2, 6   max 1  M 3,6   r2 , r7  1  M  M   r , r  2, 2 4, 6 3 7   max 1, (1  0)  1, (1  0  0)  1  max 1,1,1  1. (9) i = 3, j = 8,  r3 , r8    G, C   1 M 3, 8  M 3, 7   max 1  M 4,7   r3 , r8  1  M  M   r , r  3, 3 5, 7 4 8   max 1, (1  0)  1, (1  0  0)  0  max 1,1,0  1. r3 matches with r8 . (10) i = 4, j = 9,  r4 , r9    C , C   0 M 4,9  M 4 ,8   max 1  M 5,8   r4 , r9  1  M  M   r , r  4, 4 6 ,8 5 9   max 0, (1  0)  0, (1  0  0)  0  max 0,0,0  0. (11) i = 5, j = 10,  r5 , r10    C, C   0 M 5,10  M 5, 9   max 1  M 6,9   r5 , r10  1  M  M   r , r  5, 5 7 ,9 6 10   max 0, (1  0)  0, (1  0  0)  0  max 0,0,0  0. (12) i = 1, j = 7,  r1 , r7     A,U   0 6-11 M 1, 7 M 1, 6  1  M 2, 6    r1 , r7   max  1  M 1,1  M 3, 6    r2 , r7  1  M  M   r , r  1, 2 4, 6 3 7   max 1, (1  1)  1, (1  0  0)  1, (1  0  0)  1  max 1,2,1,1  2. (13) i = 2, j = 8,  r2 , r8    G, C   1 M 2 ,8 M 2, 7  1  M 3, 7    r2 , r8   max  1  M 2, 2  M 4, 7    r3 , r8  1  M  M    r , r  2,3 5, 7 4 8   max 1, (1  1)  1, (1  0  0)  1, (1  0  0)  0  max 1,2,1,0  2. r2 matches with r8 ; r3 matches with r7 . (14) i = 3, j = 9,  r3 , r9    G, C   1 M 3, 9  M 3, 8  1  M 4,8    r3 , r9   max  1  M 3,3  M 5,8    r4 , r9  1  M  M    r , r  3, 4 6 ,8 5 9   max 1, (1  0)  1, (1  0  0)  0, (1  0  0)  0  max 1,1,0,0  1. r3 matches with r9 . (15) i = 4, j = 10,  r4 , r10    C ,U   0 6-12 M 4,9  1  M 5,9    r4 , r10  M 4,10  max  1  M 4, 4  M 6,9    r5 , r10  1  M  M    r , r  4,5 7 ,9 6 10   max 0, (1  0)  0, (1  0  0)  0, (1  0  0)  0  max 0,0,0,0  0. (16) i = 1, j = 8,  r1 , r8     A, C   0 M 1,8 M 1, 7  1  M 2, 7    r1 , r8    max 1  M 1,1  M 3, 7    r2 , r8  1  M  M    r , r  1, 2 4,7 3 8  1  M 1,3  M 5, 7    r4 , r8    max 2, (1  1)  0, (1  0  1)  1, (1  0  0)  1, (1  0  0)  0  max 2,0,1,1,0  2. r1 matches with r7 ; r2 matches with r6 . (17) i = 2, j = 9,  r2 , r9    G, C   1 M 2,9  M 2 ,8  1  M 3,8    r2 , r9    max 1  M 2, 2  M 4,8    r3 , r9  1  M  M    r , r  2,3 5,8 4 9  1  M 2, 4  M 6,8    r5 , r9    max 2, (1  1)  1, (1  0  0)  1, (1  0  0)  0, (1  0  0)  0  max 2,2,1,0,0  2. r2 matches with r9 ; r3 matches with r8 . (18) i = 3, j = 10,  r3 , r10    G,U   1 6-13  M 3, 9  1  M 4,9    r3 , r10   M 3,10  max 1  M 3,3  M 5,9    r4 , r10  1  M  M    r , r  3, 4 6,9 5 10  1  M 3,5  M 7 ,9    r6 , r10    max 1, (1  0)  1, (1  0  0)  0, (1  0  0)  0, (1  0  0)  0  max 1,1,0,0,0  1. r3 matches with r10 (19) i = 1, j = 9,  r1 , r9     A, C   0 M 1,9 M 1,8  1  M 2,8    r1 , r9  1  M  M    r , r  1,1 3, 8 2 9   max  1  M 1, 2  M 4,8    r3 , r9  1  M  M    r , r  1, 3 5,8 4 9  1  M 1, 4  M 6,8    r5 , r9   max 2, (1  2)  0, (1  0  1)  1, (1  0  0)  1, (1  0  0)  0, (1  0  0)  0  max 2,0,1,1,0,0  2. r1 matches with r7 ; r2 matches with r6 . (20) i = 2, j = 10,  r2 , r10    G,U   1 M 2,9  1  M 3,9    r2 , r10  1  M  M    r , r  2, 2 4,9 3 10  M 2,10  max  1  M 2,3  M 5,9    r4 , r10  1  M  M    r , r  2, 4 6,9 5 10  1  M 2,5  M 7 ,9    r6 , r10   max 2,2,1,1,0,0  2. r2 matches with r10 ; r3 matches with r9 6-14 (21) i = 1, j = 10,  r1 , r10     A,U   1 M 1,9  1  M 2,9    r1 , r10  1  M  M    r , r  1,1 3, 9 2 10  M 1,10  max 1  M 1, 2  M 4,9    r3 , r10  1  M  M    r , r  1, 3 5, 9 4 10  1  M 1, 4  M 6,9    r5 , r10   1  M 1,5  M 7 ,9    r6 , r10   max 2,3,2,1,0,0,0  3. r1 matches with r10 ; r2 matches with r9 ; r3 matches with r8 . In the following, we analyze the time-complexity of Algorithm 6.1. Clearly, the cost ofStep 1 is O(n). For Step 2, there are at most   n n i 1 j i  4 iterations and each iteration costs  (j . i) time. Hence, the cost of Step 2 is   n n i 1 j i  4 ( j  i)  (n 3 ) Hence, the total time-complexity of Algorithm 6.1 is (n 3 ) . 6.3 Loop Dependent Free Energy Rules Let us consider Figure 6.2. The secondary structure in Figure 6.2 (c) is similar to that in Figure 6.2 (f). Yet it is obvious that the structure in Figure 6.2 (f) is much better than that in Figure 6.2 (c) because there is a base, namely base C, inside the loop A–G–C–C–U of Figure 6.2 (c), which is not paired with any other base. On the other hand, there is no such an unpaired base in the loop A–G–C–U of Figure 6.2 (f). This is shown in Figure 6.7. 6-15 Figure 6.7: The Loop Marked in Structure (b) Is More Stable Than That in Structure(a) When two base pairs are consecutive, we say that there is a stacking interaction between them and this interaction increases the stability of the secondary structure. We therefore prefer loops with stacking interaction. Suppose that (ri, rj) is a base pair in S. A base rp, where i < p < j, is accessible from (ri, rj) if there is no other base pair (ri_, rj_) in S such that i < i_ < p < j_ < j. Similarly, we say that the base pair (rp, rq) is accessible from (ri, rj) if both rp and rq are accessible from (ri, rj ). For instance, in Figure 6.8, bases r3 and r8 are accessible from base pair (r2, r9) (i.e., base pair (r3, r8) is accessible from (r2, r9)), but other bases are not accessible from (r2, r9). Given a secondary structure, the loop is a substructure consisting of (ri, rj) and all bases accessible from (ri, rj). It can be seen that every base pair (ri, rj) corresponds to a loop and (ri, rj) is called the exterior (or closing) base pair of this loop and other base pairs accessible from (ri, rj) are called interior base pairs of this loop. The size and degree of a loop are de.ned to be the number of unpaired bases and the number of base pairs in the loop, respectively. For example, the secondary structure of Figure 6.8 contains the following three loops. Loop 1: {r1, r2, r9, r10} (i.e., A–G–C–U), Loop 2: {r2, r3, r8, r9} (i.e., G–G–C–C), Loop 3: {r3, r4, r5, r6, r7, r8} (i.e., G–C–C–U–U–C), where the exterior base pair, interior base pair, size and degree of each loop are listed in Table 6.2. 6-16 According to the degree and size, the loops can be further distinguished as one of the Figure 6.8: A Secondary Structure with Three Loops Table 6.2: The Exterior BP (Base Pair), Interior BP, Size and Degree of Loops 1, 2 and 3 in Figure 6.8 following different types. (1) Hairpin loop: A loop of degree 1 is called a hairpin loop. See Figure 6.9 (a). (2) Stacked pair: A loop of degree 2 is called a stacked pair if its size is zero. See Figure 6.9 (b). (3) Bulge loop: A loop of degree 2 and non-zero size is called a bulge loop if its exterior and interior base pairs are adjacent. See Figure 6.9 (c). (4) Interior loop: A loop of degree 2 and non-zero size is called an interior loop if its exterior and interior base pairs are not adjacent. See Figure 6.9 (d). (5) Multiloop: A loop of degree greater than 2 is called a multiloop. See Figure 6.9 (e). In a secondary structure, the collection of adjacent unpaired bases which are not accessible by any base pair is called an external element (see Figure 6.10). For 6-17 convenience, we can view the external element as a special type of loop, called an exterior loop. It is not hard to see that any secondary structure S can be uniquely decomposed into loops (see Figure 6.10). If we assign an energy to each loop in S, then the free energy of S is assumed to be the sum of the energies of all loops. Since the energy of a folded structure is measured relatively to the unfolded sequence, exterior loops do not contribute any energy. Hence, we assume that the energies of exterior loops are zero. The energies of other loops depend only on the size and type of the loop and are usually determined by experiment. We use the following notation to denote the energies of various loops. Figure 6.9: Various Types of Loops (with Exterior Base Pair (ri, rj)): (a) A Hairpin Loop of Degree 1 and Size 4. (b) A Stacked Pair of Degree 2 and Size 0. (c) A Bulge Loop of Degree 2 and Size 4. (d) An Interior Loop of Degree 2 and Size 8. (e) A Multiloop of Degree 4 and Size 12. 6-18 Figure 6.10: The Secondary Structure of an RNA Can Be Uniquely Decomposed into Loops (1) Hairpin loops: We use H(k) to denote the energy of a hairpin loop with size k. (2) Stacked pairs: We use S to denote the energy of a stacked pair. (3) Bulge Loops : We use B(k) to denote the energy of a bugle loop with size k. (4) Interior loops: We use I(k) to denote the energy of an interior loop with size k. (5) Multiloop: We use M to denote the energy of a multiloop, which usually expressed by the following affine penalty function. M  M E  M I  (deg ree  1)  M B  size where ME,MI and MB are constants, and degree and size are the degree and size of the loop, respectively. In the right hand side of above equation, ‧ the first term ME denotes the stabilizing contribution from the exterior base pair, ‧ the second term MI ×(degree - 1) denotes the stabilizing contributions from interior base pairs (there are (degree - 1) interior base pairs), and ‧ the last termMB×size denotes the destabilizing contributions from unpaired bases (there are size unpaired bases). In next section, we shall introduce a dynamic programming algorithm to find a secondary structure with the minimum free energy. 6.4 Minimum Free Energy Algorithm In this section, we adopt the loop dependent free energy rules to describe the secondary structure prediction problem. In this model, any secondary structure can be uniquely decomposed into loops and the free energy of the structure is the sum of the 6-19 energies of all loops. Then the problem is to find an optimal secondary structure (i.e., a secondary structure with the minimum free energy). In the following, we shall introduce a dynamic programming algorithm to solve this problem. Recall that the admissible base pairs we consider here are Watson-Crick base pairs (i.e., G≡C and A=U) and wobble base pairs (i.e., G–U) and we use a function ρ(ri, rj) to indicate whether any two bases ri and rj can be a legal base pair: if (ri , r j )  WW 1  (ri , r j )     otherwise where WW = {(A, U), (U, A), (G, C), (C, G), (G, U), (U, G)} Let Si,j denote the optimal structure of the substring Ri,j = riri+1 · · · rj. We use Ei,j to denote the free energy of Si,j . To compute Ei,j , we consider the following three cases from rj point of view. Case 1: In the optimum solution, rj is not paired with any other base. Then we have Ei,j = Ei,j-1. Case 2: In the optimum solution, ri is paired with rj and ρ(ri, rj) = 1. Then there may be one or more loops between ri and rj . For simplicity, we use Li,j to denote the structure with the minimum free energy in this case and use Fi,j to denote the free energy of Li,j . (The calculation of Fi,j will be described later.) Then we have Ei,j = Fi,j . Case 3: In the optimum solution, ri is paired with some rk, where i+1  k  j-4, and ρ(rk, rj) = 1. In this case, we can divide Ri,j into two subsequences Ri,k.1 and Rk,j such that  i , j   i ,k 1  Fk , j Since we want to .nd the k between i + 1 and j - 4 such that  i, j is the minimum, we have  i , j  min { i , k 1  Fk , j } i 1 k  j  4 In summary, we have the following recursive formula to compute  i, j 6-20  i, j   i , j 1  min  Fi , j   (ri , rj )  {( i , k 1  Fk , j )   (rk , rj ) i 1min k  j 4 By definition, ri and rj cannot form a base pair if j - i  t = 3 since Ri,j does not fold itself too sharply. Hence, we have to set the boundary conditions of functions  and F as follows.  i , j  Fi , j   if j i 3 Next, we explain how to compute Fi,j in detail. Since (ri, rj) is a base pair in Li,j ,(ri, rj) must be an exterior base pair of some one loop, say L. According to the loop type of L, we consider the following four cases. Case 1: L is a hairpin loop. Since the size of L is j.i.1, we have Fi,j = H(j-i-1). Figure 6.11: Illustration of Case 1 Case 2: L is a stacked pair. Then we have Fi,j = S + Fi+1,j-1. Figure 6.12: Illustration of Case 2 6-21 Case 3: L is a bugle loop. Let (rp, rq) be the interior base pair of L. By de.nition,(ri, rj) and (rp, rq) are adjacent with either p = i + 1 or q = j - 1 (but not both). (1) Suppose that p = i+1 and q _= j.1. Then the range of q is [p+4, j.2] = [i+5, j.2] and the size of L is j - q - 1. Hence, we have Fi , j  min {( j  q  1)  Fi1,q } i 5 p j 2 (2) Suppose that q = j.1 and p _= i+1. Then the range of p is [i+2, q-4] = [i+2, j.5] and the size of L is p - i - 1. Hence, we have Fi , j  min {( p  i  1)  Fp, j 1} i 2 p j 5 Figure 6.13: Illustration of Case 3 As a result, we have  min {( j  q  1)  Fi1,q i5q j 2 Fi , j  min  {( p  i  1)  Fp , j 1 i2min  p j 5 Case 4: L is an interior loop. Let (rp, rq) be the interior base pair of L. Then we have i+1  p+3 < q  j -1 and the size of L is p- i+ j -q - 2. Since, by definition, (ri, rj) and (rp, rq) are not adjacent, we have p . i + j . q ≡ 4. Hence, we have Fi , j  min {( p  i  j  q  2)  Fp ,q } i 1 p 3 j 1 p 1 j q4 6-22 Figure 6.14: Illustration of Case 4 Case 5: L is a multiloop. Suppose that (rp, rq) is the rightmost interior base pair of L (see Figure 6.15). In this case, we can represent the minimum free energy of L as follows. Fi , j  min {( p  i  j  q  2)  Fp ,q } i 1 p 3 j 1 p 1 j q4 where ‧ g 1p , j 1 is the minimum free energy of the substructure Lp,q plus the energies contributed from interior base pair (rp, rq) and unpaired bases rq+1rq+2 · · · rj.1. That is, g 1p , j 1  min {Fp ,q  M I  M B  ( j  q  1)} pq j ‧ g i21, p1 is the minimum free energy of the remaining section of L (the calculation of g i21, p1 will be described later). 6-23 Figure 6.15: Illustration of Case 5: (rp, rq) Is the Rightmost Interior Base Pair of Multiloop In the following, we will discuss the calculation of g i21, p1 , which is the minimum free energy of the remaining section L_ of L. This section L_ may contain one or more loops. Case 1: Suppose that L_ contains only one loop. Then g i21, p1 is equal to g 1k , p 1 plus the energies contributed from unpaired bases ri+1ri+2 · · · rk.1. That is, g i21, p 1  min {g 1k , p 1  M B  (k  i  1)} ik  p Figure 6.16: Illustration of Case 1 6-24 Case 2: Suppose that L' contains two or more loops. Then we have g i21, p 1  min {g 1k , p 1  g i21,k 1} ik  p Figure 6.17: Illustration of Case 2 In summary, we have g 2 i 1, p 1 1  min i k  p {g k , p 1  M B  (k  i  1)}  min  1 2  min u k  p {g k , p 1  g i 1,k 1} According to the discussion above, we have the following recursive formula to compute Fi,j . ‧ If j . i  3, then Fi,j = +∞. ‧ If j . i > 3, then Fi , j   H ( j  i  1)  S  Fi 1, j 1   min {B( j  q  1)  Fi 1,q    i 5 q  j  2   min   {B( p  i  1)  F p , j 1   i  2min  p  j 5   {( p  i  j  q  2)  F p ,q } i 1 pmin  3 q  j 1  p i  j  q  4  {M E  g 1p , j 1  g i21, p 1 } imin  p j where g 1p , j 1  min {Fp ,q  M I  M B  ( j  q  1)} pq j 6-25 g 2 i 1, p 1 1  min i k  p {g k , p 1  M B  (k  i  1)}  min  1 2  min u k  p {g k , p 1  g i 1,k 1} By the above recursive formula, we can compute all Fi,j, 1  i  j  n in (n 4 ) time using dynamic programming technique. Here, we assume that the values of Fi,j have been computed in advance and design Algorithm 6.2 to compute the minimum free energy  1,n of an RNA sequence R1,n = r1r2 · · · rn. We analyze the time-complexity of Algorithm 6.2 as follows. Clearly, the costs of Steps 1 and 2 are O(n). For Step 3, there are at most n n i 1 j i  4   iterations and each iteration costs O(j . i) time. Hence, the cost of Step 3 is n n i 1 j i  4   ( j  i)  (n 3 ) However, the preprocessing of Fi,j costs time-complexity of Algorithm 6.2 is (n 4 ) . (n 4 ) time. Hence, the total Algorithm 6.2 An RNA minumum free engergy algorithm Input: An RNA sequence R = r1r2 · · · rn. Output: The minimum free energy  i,n of sequence R. Step 1: /* Computation of ρ(ri, rj) function for 1  i < j  n */ WW = {(A, U), (U, A), (G, C), (C, G), (G, U), (U, G)}; for i = 1 to n do for j = i to n do if (ri, rj) then ρ(ri, rj) = 1; else ρ(ri, rj) = +∞; end for end for Step 2: /* Initialization of  i, j for j - i  3 */ for i = 1 to n do for j = i to i + 3 do if j  n then  i, j = +∞; 6-26 end for end for Step 3: /* Calculation of  i, j for j - i > 3 */ for i = n . 4 downto 1 do for j = i + 4 to n do case1 =  i , j 1 ; case2 = Fi,j × ρ(ri, rj); case3 = min {( i ,k 1  Fk , j )   (rk , r j ) i 1 k  j  4  i, j = min{case1, case2, case3}; end for end for 6.5 The RNA Secondary Structure Prediction with Simple Pseudoknots In the previous sections, we discussed RNA secondary structure prediction. The model is called the RNA secondary structure without pseudoknots. In this model, given a sequence of amino acids a1 a1  a n , if ( a i , a j ) is a base pair and (ah , ak ) is another base pair, and i < h, then i  h  k  j . An example is illustrated in Fig. 6.18. 3 4 2 5 1 6 1 2 (a) 3 4 5 6 (b) Fig. 6.18. An RNA Secondary Structure without Pseudoknots and its Schemeatic 6-27 Diagram. Fig. 6.19 shows an RNA secondary structure with pseudoknots. In this model, suppose that ( a i , a j ) and (ah , ak ) are two base pairs in an RNA secondary structure. Then it is possible that i  h  j  k , as illustrated in Fig. 6.19. 6 3 2 4 5 1 2 3 4 5 6 1 (a) (b) Fig. 6.19. An RNA Secondary Structure with Simple Pseudoknots and Its Schematic Diagram The model in Fig. 6.19 is actually not general enough. Fig. 6.20 shows a more general case for pseudoknots. The model in Fig. 6.19 is called the simple pseudoknot model while the model in Fig. 6.20 is a recursive pseudoknot model. We shall only discuss the simple pseudoknot model in this book. 6-28 14 4 13 3 12 11 9 10 5 8 6 2 7 1 (a) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (b) Fig. 6.20. An RNA Secondary Structure with Recursive Pseudoknots and Its Schematic Diagram If we compare the three models, we may make some conclusion as follows: 1. 2. 3. 4. The RNA itself is considered as a linear sequence if its secondary structure is not considered. If the RNA secondary structure is considered, there is at most one turn in the model without pseudoknots. That is, the terminal amino acid may turn back to the starting amino acid. See Fig. 6.18. In the RNA secondary structure with simple pseudoknots, there are at most two turns. That is, the sequence turns and turns again. See Fig. 6.19. In the RNA secondary structure with recursive pseudoknots, there may be more than two turns. See Fig. 6.20. Formal Definition of Simple Pseudoknots 6-29 Consider an RNA sequence a1a2 an . This sequence has a secondary structure with simple pseudoknots if there exist j1 and j 2 such that the following conditions are satisfied: 1. Each base pair ( a i , a j ) must be such that either 1  i  j1  j  j2 or j1  i  j2  j  n . 2. If base pairs ( a i , a j ) and ( a i ' , a j ' ) are such that i  i'  j1 or j2  i  i' , then j  j '. Conditions 1 and 2 mentioned above are now illustrated in Fig. 6.21 and Fig. 6.22 respectively. 1 i j1 j j2 n (a) Fig. 6.21 1 j1 j2 i (b) Condition 1 of the Definition of Simple Pseudoknots 6-30 j n 1 i i' j1 j' j j2 n j2 j' j n (a) 1 j1 i i' (b) Fig. 6.22 Condition 2 of the Definition of Simple Pseudoknots To give the reader some feeling of the above discussion, let us consider the following RNA sequence: S: CUUCAUCAGGAAAUGAC A secondary structure of the above sequence without pseudoknots is shown in Fig. 6.23 and that with simple pseudoknots is shown in Fig. 6.24. In Fig.6.24, we have also indicated the positions of j1 and j 2 . G A G A C A U A A U C G U U A C C 6-31 Fig. 6.23 Secondary Structure of Sequence S without Pseudoknots C j1 U A C G A A U C G U G U A C A A Fig.6.24 j2 A Secondary Structure of Sequence S with Simple Pseudoknots It can be seen from the above two figures that in this case, the secondary structure with simple pseudoknots contains more base pairs than that without pseudoknots. Of course, given an RNA sequence, there may be many different secondary structures corresponding to this sequence. An optimal RNA secondary structure with simple psudoknots is a secondary structure which contains maximum number of base pairs among all secondary structures with simple pseudoknots. The RNA secondary structure with simple pseudoknots problem is thus defined as follows: Given an RNA, find an optimal secondary structure with simple pseudoknots for this sequence. The above problem is a polynomial problem and can be solved by a dynamic programming algorithm which will be introduced below. Given an RNA sequence S  a1a1 an , let us consider three amino acids a i , a j and a k where 1  i  j  k  n . There are three cases: 6-32 Case 1. ( a i , a j ) is a base pair. Let L(i, j , k ) denote the number of base pairs of an optimal secondary structure with simple pseudoknots of S from a i to a k corresponding to Case 1. Case 2. Neither ( a i , a j ) nor (a j , a k ) is a base pair. Let M (i, j , k ) denote the number of base pairs of an optimal secondary structure of S with simple pseudoknots from a i to a k corresponding to Case 2. Case 3. (a j , a k ) is a base pair. Let R(i, j , k ) denote the number of base pairs of an optimal secondary structure of S with simple pseudoknots from a i to a k corresponding to Case 3. Perhaps it is significant to pause here to think back about the model without pseudoknots. In that case, if we use the dynamic programming approach, we always start to think whether (ai , a k ) is a base pair. In the model with simple pseudoknots, we do not consider whether (ai , a k ) is a base pair. Instead, we consider whether ( a i , a j ) is a base pair and whether (a j , a k ) is a base pair. In the following, let v(ai , a j )  1 if ( a i , a j ) is a base pair; otherwise, let v(ai , a j )   . Consider Case 1. We may think along the following line: 1. Since ( a i , a j ) is a base pair, we may consider the optimal secondary structure with simple pseudoknots where (ai 1 , a j 1 ) is a base pair, as shown in Fig. 6.25(a). 2. Since ( a i , a j ) is a base pair, we may consider the optimal secondary structure with simple pseudoknots where neither (ai 1 , a j 1 ) nor ( a j 1 , a k ) is a base pair, as shown in Fig. 6.25(b). 3. Since ( a i , a j ) is a base pair, we may consider the optimal secondary structure with simple pseudoknots where ( a j 1 , a k ) is a base pair as shown in Fig. 6.25(c). 6-33 k k k i j i j i j i-1 j+1 i-1 j+1 i-1 j+1 (a) (b) (c) Fig. 6.25 Recursive formulas for L(i,j,k) The above discussion shows that we have the following recursive formula: L(i  1, j  1, k )    L(i, j, k )  v(ai , a j )  max M (i  1, j  1, k ) R(i  1, j  1, k )    (1) An illustration of the equation for L(i,j,k) is shown in Fig. 6.25. Using similar reasoning, we have L(i, j  1, k  1    R(i, j, k )  v(a j , a k )  max M (i, j  1, k  1) R(i, j  1, k  1)    (2) The above equation is illustrated in Fig.6.26. k i j k-1 i j j+1 (a) k k k-1 i j j+1 j+1 (b) 6-34 (c) k-1 Fig. 6.26. Recursive Formulas for R(i,j,k) Finally, we have the following recursive formula for M(i,j,k). M (i  1, j, k ), M (i, j  1, k ), M (i, j, k  1)   m(i, j, k )  max L(i  1, j, k ), L(i, j  1, k )  R(i, j  1, k ), R(i, j, k  1)    (3) We need some formulas for the boundary conditions: 1. L(i, j , j )  v(ai , a j ) for all i<j . (4) 2. L(i0  1, j , k )  R(i0  1, j, k )  M (i 0,1, j, )  0 if k=j or k=j+1. (5) Let A(i0 , k 0 ) denote the number of base pairs of in optimal RNA secondary structure with simple pseudoknots from ai 0 to a k 0 . Obviously, A(i0 , j0 )  max L(i, j, k ), M (i, j, k ), R(i, j, k ) i 0i  j k k 0 (6) Given an RNA S  a1a2 an , an optimal secondary structure of S can be determined by using Equation (6). In the following, we give an algorithm which produces an optimal RNA secondary structure. Algorithm 6.3: An algorithm which produces an optimal RNA secondary structure with simple pseudoknots. Input: An RNA sequence S= a1 , a2 ,, an . Output: An optimal secondary structure of RNA with simple pseudoknots Step1: WW={(A,U),(U,A),(G,C),(C,G),(G,U),(U,G)}; for i = 1 to n do for j = i+2 to n do if (ai , a j )  WW then v ( a i , a j ) =1; else v ( a i , a j ) =0; 6-35 end for end for Step2: for i = 1 to n - 2 do for j = i + 1 to n - 1 do for k = j + 1 to n do case1= L( i, j, k ); case2= R( i, j, k ); case3= m( i, j, k ); Ai , k =max{case1,case2,case3}; Tracepairi,k=max{ Lpairi,j,k , Mpairi,j,k , Rpairi,j,k } ; end for end for end for L( i, j, k ) { if i = 0 and ( k = j or k = j + 1 ) then Li,j,k=0 return ; if k=j and i<j then if v ( a i , a j ) >0 then Lpairi,j,k = i , j ; Li,j,k= v ( a i , a j ) return ; else{ case1=L( i - 1, j + 1, k ) ; case2=R( i - 1, j + 1, k ) ; case3=m( i - 1, j + 1, k ) ; Li,j,k = v ( a i , a j ) + max{case1,cas2,case3} ; Lpairi,j,k = i , j+max{ Lpairi-1,j+1,k , Mpairi-1,j+1,k , Rpairi-1,j+1,k } ; return ; } } 6-36 R( i, j, k ) { if i = 0 and ( k = j or k = j + 1 ) then Ri,j,k=0 return ; if k=j+1 then Ri,j,k=0 return ; else { case1=L( i, j + 1, k – 1 ) ; case2=R( i, j + 1, k – 1 ) ; case3=m( i, j + 1, k – 1 ) ; Ri,j,k = v(a j , ak ) + max{case1,cas2,case3} ; Rpairi,j,k = j , k+max{ Lpairi,j+1,k-1 , Mpairi,j+1,k-1 , Rpairi,j+1,k-1 } ; return ; } } m( i, j, k ) { if i = 0 and ( k = j or k = j + 1 ) then mi,j,k=0 return ; case1= m( i-1, j, k ) ; case2= m( i, j+1, k ) ; case3= m( i, j, k-1 ) ; case4= L( i-1, j, k ) ; case5= L( i, j+1, k ) ; case6= R( i, j+1, k ) ; case7= R( i, j, k-1 ) ; mi,j,k= max{case1,case2,case3,case4,case5,case6,case7}; Mpairi,j,k = max{ Mpairi-1,j,k , Mpairi,j+1,k , Mpairi,j,k-1 , Lpairi-1,j,k , Lpairi,j+1,k , Rpairi,j+1,k, Rpairi,j,k-1 }; return ; } The following is a printing of the output of applying the above algorithm to the following sequence: a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 C U U C A U C A G G A A A U G A 6-37 C Tracepair : (1,10), (2,9), (3,8), (5,15), (6,14), (7,13) The result is the same as shown in Fig. 6.24. 6-38

6.5 The RNA Secondary Structure Prediction with Simple Pseudoknots

Related documents

Products

Support

6.5 The RNA Secondary Structure Prediction with Simple Pseudoknots

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib