RNA secondary structures Introduction, algorithms of Nussinov and Zuker Dominic Rose Bioinformatics Group, University of Freiburg Bioinformatics 2, winterterm 2011 Protein-coding vs. non-coding genes NcRNAs, Example: Regulation by micro RNAs NcRNAs, Example: Regulation by micro RNAs Non-coding RNAs: key regulators Amaral et al., Science, 2008 Non-coding RNAs: hot stuff Taft et al., J Pathol, 2010 Non-coding RNAs: key regulators Eukaryotic genome organization Taft et al., J Pathol, 2010 Structured RNAs: examples Structural conformations of biomolecules I Primary Structure: sequence of monomeres, ATCGAGATC... I Secondary Structure: 2D-fold, defined by hydrogen bonds I Tertiary Structure: 3D-fold I Quarternary Structure: complex arrangement of multiple folded molecules RNA secondary structure, basics. I RNA single-stranded: compl. regions fold back onto itself. Paired bases → double helical stretches. I Unpaired bases → loop regions. I Only canonical Watson-Crick C≡ G, A=U and G−U wobble pairs are considered. Other non-canonical pairs are neglected. I I RNA secondary structure, basics. I RNA single-stranded: compl. regions fold back onto itself. Paired bases → double helical stretches. I Unpaired bases → loop regions. I I Only canonical Watson-Crick C≡ G, A=U and G−U wobble pairs are considered. Other non-canonical pairs are neglected. I I RNA exists in tertiary/quaternary conformation in-vivo. I Its function is mainly determined via its spatial structure. I 3D-interactions usually do not change the 2D-structure. I Both processes can be independently described. I RNA folding: hierarchical process in which secondary structure is broadly considered as sufficient approximation assessing the most relevant characteristics of an RNA molecule. Prediction of RNA secondary structure, easy task? RNAfold < trna.fa >AF041468 GGGGGUAUAGCUCAGUUGGUAGAGCGCUGCCUUUGCACGGCAGAUGUCAGGGGUUCGAGUCCCCUUACCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))). -31.10 kcal/mol Prediction of RNA secondary structure, easy task? RNAfold < trna.fa >AF041468 GGGGGUAUAGCUCAGUUGGUAGAGCGCUGCCUUUGCACGGCAGAUGUCAGGGGUUCGAGUCCCCUUACCUCCA (((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))). -31.10 kcal/mol RNA secondary structure elements RNA secondary structure Definition (RNA-structure) Let S ∈ {A, C, G, U}∗ be a sequence. An RNA-structure over S is a set of pairs i < j ∧ Si and Sj form P = (i, j) | a Watson-Crick or non-standard pair (G—U) with the property that the associated graph has degree ≤ 1 RNA secondary structure Definition (RNA-structure) Let S ∈ {A, C, G, U}∗ be a sequence. An RNA-structure over S is a set of pairs i < j ∧ Si and Sj form P = (i, j) | a Watson-Crick or non-standard pair (G—U) with the property that the associated graph has degree ≤ 1 Remark: degree=1: every base can have at most one bond, i.e. P satisfies the following property: 0 0 ∀(i, j) : (i, j) ∈ P =⇒ ∀j : (i, j ) 6∈ P ∧ 0 0 ∀(i, j) : (i, j) ∈ P =⇒ ∀i : (i , j) 6∈ P RNA secondary structure representations the purine riboswitch (Rfam RF00167) Prediction of RNA secondary structure, constraints I RNA secondary structure can formally be described as a list of base pairs (i, j) fulfilling the following constraints: 1. A base can only participate in one base pair. → excludes tertiary structure motifs. 2. Paired bases must be separated by at least three (unpaired) bases. → restricts the bendability of the RNA backbone and defines a minimum loop size of three bases 3. Crossing of two base pairs (i, j) and (k , l) in the sense that i < k < j < l is not allowed. → excludes pseudoknots. → pseudoknot-free structures are called nested. Prediction of RNA secondary structure, how? I I RNA SS prediction is a traditional bioinformatic problem, first attempts almost 40 years ago [Tinoco et al. 1971]. Algorithms: I I I I Simple RNA folding: base pair maximization (Nussinov, 1978) RNA energy model + energy minimization (Zuker, 1981) Probabilistic analysis: base pair probabilities (McCaskill, 1990) Simultaneous alignment and folding (Sankoff, 1985) RNA structure decomposition I Each pseudoknot-free RNA secondary structure has a planar embedding, can be represented as an outer-planar graph (=can be drawn on paper without intersection). planar not planar I A planar graph can uniquely be decomposed into loops constituting the faces (components) of the planar drawing. I So it is for RNAs: given a (closing) base pair (i, j) a loop consists of all immediately interior bases k such that i < k < j and there exists no other base pair (p, q) such that i < p < k < q < j. RNA structure decomposition I RNA secondary structure consists of six major types of loops. I The number of unpaired nucleotides of a loop denotes its size/length. I The number of base pairs delimiting a loop (including the closing pair) is termed degree. Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 I o o o (3 bases) Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 I o o o (3 bases) → Ni,i+2 = 1 Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 I o o o (3 bases) → Ni,i+2 = 1 I o o o o (4 bases) Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 I o o o (3 bases) → Ni,i+2 = 1 I o o o o (4 bases) → Ni,i+3 = 1 (still no bp possible, only 1 conformation) Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 I o o o (3 bases) → Ni,i+2 = 1 I o o o o (4 bases) → Ni,i+3 = 1 (still no bp possible, only 1 conformation) I 5 bases? Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 I o o o (3 bases) → Ni,i+2 = 1 I o o o o (4 bases) → Ni,i+3 = 1 (still no bp possible, only 1 conformation) I 5 bases? 1 if πxi ,xi+4 = 0 (unpaired) Ni,i+4 = 2 if πxi ,xi+4 = 1 (paired) Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 I o o o (3 bases) → Ni,i+2 = 1 I o o o o (4 bases) → Ni,i+3 = 1 (still no bp possible, only 1 conformation) I 5 bases? 1 if πxi ,xi+4 = 0 (unpaired) Ni,i+4 = 2 if πxi ,xi+4 = 1 (paired) I Fix a position, is it paired or unpaired? Number of possible structures on subsequence x[i, j]? Counting structures: I o (a single base) → Ni,i = 1 I o o (2 bases) → Ni,i+1 = 1 I o o o (3 bases) → Ni,i+2 = 1 I o o o o (4 bases) → Ni,i+3 = 1 (still no bp possible, only 1 conformation) I 5 bases? 1 if πxi ,xi+4 = 0 (unpaired) Ni,i+4 = 2 if πxi ,xi+4 = 1 (paired) I Fix a position, is it paired or unpaired? I Number of structures: Ni,j = Ni+1,j + j X k =i+4 (Ni+1,k−1 · Nk +1,j · πxi ,xk ) RNA structure prediction: Nussinov I I Idea (biological): Stacked base pairs of helical regions are considered to stabilize an RNA molecule. → maximize the number of base pairs. Idea (algorithmic): the optimal structure Si,j on a subsequence x[i, j] can only be formed by two distinct ways from a shorter subsequence x[i + 1, j]: 1. Base i is unpaired, followed by an arbitrary shorter structure. 2. Base i is paired with some partner base k requiring the computation of two independent substructures: the structure enclosed by the bp and the remaining structure behind the pair. RNA structure prediction: Nussinov I The maximum base pair problem can be solved using DP implementing the following recursion: for 1 ≤ i < j ≤ n: Sij = max Si+1,j , max Si+1,k−1 + Sk+1,j + βik k , (i,k) pairs I DP-approach: solve problem for all sub problems of size 1 → the solution is zero I Knowing the solution of all problems of size less than k , compute the solution of all problems of size k . I Init: for 1 ≤ i ≤ n: Si,i = 0 and Si,i−1 = 0 I O(n3 ) time, O(n2 ) space RNA structure prediction: Nussinov I Nussinov considering (1) ij pair, (2) i being unpaired, (3) j being unpaired, and even (4) bifurcation: I Init: ∀i = 1..|S|: Si,i = 0; ∀i = 1..(|S| − 1): Si,i+1 = 0 I Termination: S1,|S| = max. number of base pairs RNA structure prediction: Nussinov RNA structure prediction: Nussinov RNA structure prediction: Nussinov RNA structure prediction: Nussinov RNA structure prediction: Nussinov, example 1/5 RNA structure prediction: Nussinov, example 2/5 RNA structure prediction: Nussinov, example 3/5 RNA structure prediction: Nussinov, example 4/5 RNA structure prediction: Nussinov, example 5/5 Nussinov drawbacks RNA structure prediction: MFE-folding I More realistic: thermodynamics and statistical mechanics. I Stability of an RNA secondary structure coincides with thermodynamic stability I Quantified as the amount of free energy released/used by forming base pairs. Solution: minimizing the free energy RNA structure prediction: MFE-folding I RNA molecules basically exist in a distribution of structures rather than a single ground-state conformation. I “Most likely” conformation: structure exhibiting minimum of free energy (MFE) I Energy contributions of different loop types have been measured. Since free energies are additive, a more sophisticated model, the standard energy model for RNA SS, can be proposed. I I Based on loop decomposition, the total energy E of a structure S can be computed as the sum over the energy contributions of each constituent loop l: X E(S) = E(l) l∈S MFE-folding, example Structural elements, formal definition Structural elements, formal definition Structural elements, example: k-multiloop Energies of secondary structure elements Zuker algorithm Zuker: recursions, loop decomposition, Vi,j Zuker: recursions, loop decomposition, Vi,j Zuker: recursions, multiloop handling, WMi,j Zuker: recursions, multiloop handling, WMi,j Zuker: recursions, assembly of sub-structures, Wi,j Zuker: complexity MFE-folding via loop decomposition: remarks I I First proposed by Zuker, here RNAfold-version: DP approach, 4 matrices used to score I I I I the whole structure (F ) a single sub-structure (C) with a closing bp, multi-loops (M), (M 1 ) M and M 1 are used to decompose a multi-loop “from the right” (in a deterministic order) MFE-folding via loop decomposition: recursions MFE-folding via loop decomposition: remarks I Fi,j : free energy of the opt. sub-struct. on the sub-seq. x[i, j]. I Ci,j : free energy of the opt. sub-struct. on the sub-seq. x[i, j] given that i and j form a base pair. I Mi,j : free energy of the opt. sub-struct. on the sub-seq. x[i, j] given that x[i, j] is part of a multi-loop and has at least one “component”. I 1 Mi,j : free energy of the opt. sub-struct. on the sub-seq. x[i, j] given that x[i, j] is part of a multi-loop and has exactly one component which has the closing pair (i, h) for some h satisfying i ≤ h < j. MFE-folding via loop decomposition: recursions I Linear multiloop energies are assumed: EML = a + b · degree + c · size I a: energy contribution of a closing base pair. I b: energy contribution of an interior base pair. degree = no. of helices many (lecture: k ) I c: energy contribution of an unpaired base size = no. of unpaired bases (lecture: k 0 ) I Energy of a hairpin closed by the base pair (i, j): H(i, j) = a + (j − i − 1) · c I Energy of an interior loop determined by two constituent base pairs (i, j) and (k, l): I(i, j; k, l) = 2a + (k − i − 1)c + (j − l − 1)c MFE-folding via loop decomposition I 1 init: Fi,i = 0, Ci,j = ∞, M, i, j = ∞, and Mi,j = ∞. I I F1,n stores the energy value of the thermodynamically most stable structure, its MFE. Reversely following the path that yielded the MFE value, the MFE structure can then be obtained by backtracking from F1,n to the diagonal. I O(n2 ) memory I O(n4 ) time, but usually optimized to run in O(n3 ) by restriction of interior loop sizes to some constant value (e.g. 30). Literature Nussinov-algorithm: Algorithms for loop matchings Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ SIAM Journal on Applied mathematics. (1978); 35:(1) 68-82 Zuker-algorithm: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information Zuker M, Stiegler P Nucleic Acids Res. (1981) Jan; 9:(1) 133-48 RNAfold: Fast folding and comparison of RNA secondary structures Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P Monatshefte für Chemie / Chemical Monthly. (1994); 125:(2) 167-188 Literature, back for good :-)