RNA secondary structures Introduction, algorithms of Nussinov and

advertisement
RNA secondary structures
Introduction, algorithms of Nussinov and Zuker
Dominic Rose
Bioinformatics Group, University of Freiburg
Bioinformatics 2, winterterm 2011
Protein-coding vs. non-coding genes
NcRNAs, Example: Regulation by micro RNAs
NcRNAs, Example: Regulation by micro RNAs
Non-coding RNAs: key regulators
Amaral et al., Science, 2008
Non-coding RNAs: hot stuff
Taft et al., J Pathol, 2010
Non-coding RNAs: key regulators
Eukaryotic genome organization
Taft et al., J Pathol, 2010
Structured RNAs: examples
Structural conformations of biomolecules
I
Primary Structure: sequence of monomeres, ATCGAGATC...
I
Secondary Structure: 2D-fold, defined by hydrogen bonds
I
Tertiary Structure: 3D-fold
I
Quarternary Structure: complex arrangement of multiple folded
molecules
RNA secondary structure, basics.
I
RNA single-stranded: compl. regions fold back onto itself.
Paired bases → double helical stretches.
I
Unpaired bases → loop regions.
I
Only canonical Watson-Crick C≡ G, A=U and G−U wobble pairs
are considered.
Other non-canonical pairs are neglected.
I
I
RNA secondary structure, basics.
I
RNA single-stranded: compl. regions fold back onto itself.
Paired bases → double helical stretches.
I
Unpaired bases → loop regions.
I
I
Only canonical Watson-Crick C≡ G, A=U and G−U wobble pairs
are considered.
Other non-canonical pairs are neglected.
I
I
RNA exists in tertiary/quaternary conformation in-vivo.
I
Its function is mainly determined via its spatial structure.
I
3D-interactions usually do not change the 2D-structure.
I
Both processes can be independently described.
I
RNA folding: hierarchical process in which secondary structure is
broadly considered as sufficient approximation assessing the
most relevant characteristics of an RNA molecule.
Prediction of RNA secondary structure, easy task?
RNAfold < trna.fa
>AF041468
GGGGGUAUAGCUCAGUUGGUAGAGCGCUGCCUUUGCACGGCAGAUGUCAGGGGUUCGAGUCCCCUUACCUCCA
(((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).
-31.10 kcal/mol
Prediction of RNA secondary structure, easy task?
RNAfold < trna.fa
>AF041468
GGGGGUAUAGCUCAGUUGGUAGAGCGCUGCCUUUGCACGGCAGAUGUCAGGGGUUCGAGUCCCCUUACCUCCA
(((((((..((((........)))).(((((.......))))).....(((((.......)))))))))))).
-31.10 kcal/mol
RNA secondary structure elements
RNA secondary structure
Definition (RNA-structure)
Let S ∈ {A, C, G, U}∗ be a sequence. An RNA-structure over S is a
set of pairs
i < j ∧ Si and Sj form
P = (i, j) |
a Watson-Crick or non-standard pair (G—U)
with the property that the associated graph has degree ≤ 1
RNA secondary structure
Definition (RNA-structure)
Let S ∈ {A, C, G, U}∗ be a sequence. An RNA-structure over S is a
set of pairs
i < j ∧ Si and Sj form
P = (i, j) |
a Watson-Crick or non-standard pair (G—U)
with the property that the associated graph has degree ≤ 1
Remark:
degree=1: every base can have at most one bond, i.e.
P satisfies the following property:
0
0
∀(i, j) : (i, j) ∈ P =⇒ ∀j : (i, j ) 6∈ P
∧
0
0
∀(i, j) : (i, j) ∈ P =⇒ ∀i : (i , j) 6∈ P
RNA secondary structure representations
the purine riboswitch (Rfam RF00167)
Prediction of RNA secondary structure, constraints
I
RNA secondary structure can formally be described as a list of
base pairs (i, j) fulfilling the following constraints:
1. A base can only participate in one base pair.
→ excludes tertiary structure motifs.
2. Paired bases must be separated by at least three (unpaired) bases.
→ restricts the bendability of the RNA backbone and defines a
minimum loop size of three bases
3. Crossing of two base pairs (i, j) and (k , l) in the sense that
i < k < j < l is not allowed.
→ excludes pseudoknots.
→ pseudoknot-free structures are called nested.
Prediction of RNA secondary structure, how?
I
I
RNA SS prediction is a traditional bioinformatic problem, first attempts almost 40 years ago [Tinoco et al. 1971].
Algorithms:
I
I
I
I
Simple RNA folding: base pair maximization (Nussinov, 1978)
RNA energy model + energy minimization (Zuker, 1981)
Probabilistic analysis: base pair probabilities (McCaskill, 1990)
Simultaneous alignment and folding (Sankoff, 1985)
RNA structure decomposition
I
Each pseudoknot-free RNA secondary structure has a planar
embedding, can be represented as an outer-planar graph (=can
be drawn on paper without intersection).
planar
not planar
I
A planar graph can uniquely be decomposed into loops
constituting the faces (components) of the planar drawing.
I
So it is for RNAs: given a (closing) base pair (i, j) a loop consists
of all immediately interior bases k such that i < k < j and there
exists no other base pair (p, q) such that i < p < k < q < j.
RNA structure decomposition
I
RNA secondary structure consists of six major types of loops.
I
The number of unpaired nucleotides of a loop denotes its
size/length.
I
The number of base pairs delimiting a loop (including the closing
pair) is termed degree.
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base)
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases)
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
I o o o (3 bases)
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
I o o o (3 bases) → Ni,i+2 = 1
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
I o o o (3 bases) → Ni,i+2 = 1
I o o o o (4 bases)
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
I o o o (3 bases) → Ni,i+2 = 1
I o o o o (4 bases) → Ni,i+3 = 1
(still no bp possible, only 1 conformation)
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
I o o o (3 bases) → Ni,i+2 = 1
I o o o o (4 bases) → Ni,i+3 = 1
(still no bp possible, only 1 conformation)
I 5 bases?
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
I o o o (3 bases) → Ni,i+2 = 1
I o o o o (4 bases) → Ni,i+3 = 1
(still no bp possible, only 1 conformation)
I 5 bases?
1 if πxi ,xi+4 = 0 (unpaired)
Ni,i+4 =
2 if πxi ,xi+4 = 1 (paired)
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
I o o o (3 bases) → Ni,i+2 = 1
I o o o o (4 bases) → Ni,i+3 = 1
(still no bp possible, only 1 conformation)
I 5 bases?
1 if πxi ,xi+4 = 0 (unpaired)
Ni,i+4 =
2 if πxi ,xi+4 = 1 (paired)
I
Fix a position, is it paired or unpaired?
Number of possible structures on subsequence x[i, j]?
Counting structures:
I o (a single base) → Ni,i = 1
I o o (2 bases) → Ni,i+1 = 1
I o o o (3 bases) → Ni,i+2 = 1
I o o o o (4 bases) → Ni,i+3 = 1
(still no bp possible, only 1 conformation)
I 5 bases?
1 if πxi ,xi+4 = 0 (unpaired)
Ni,i+4 =
2 if πxi ,xi+4 = 1 (paired)
I
Fix a position, is it paired or unpaired?
I
Number of structures:
Ni,j = Ni+1,j +
j
X
k =i+4
(Ni+1,k−1 · Nk +1,j · πxi ,xk )
RNA structure prediction: Nussinov
I
I
Idea (biological): Stacked base pairs of helical regions are
considered to stabilize an RNA molecule.
→ maximize the number of base pairs.
Idea (algorithmic): the optimal structure Si,j on a subsequence
x[i, j] can only be formed by two distinct ways from a shorter
subsequence x[i + 1, j]:
1. Base i is unpaired, followed by an arbitrary shorter structure.
2. Base i is paired with some partner base k requiring the
computation of two independent substructures: the structure
enclosed by the bp and the remaining structure behind the pair.
RNA structure prediction: Nussinov
I
The maximum base pair problem can be solved using DP
implementing the following recursion: for 1 ≤ i < j ≤ n:
Sij = max Si+1,j , max
Si+1,k−1 + Sk+1,j + βik
k , (i,k) pairs
I
DP-approach: solve problem for all sub problems of size 1
→ the solution is zero
I
Knowing the solution of all problems
of size less than k , compute the solution
of all problems of size k .
I
Init: for 1 ≤ i ≤ n: Si,i = 0 and Si,i−1 = 0
I
O(n3 ) time, O(n2 ) space
RNA structure prediction: Nussinov
I
Nussinov considering (1) ij pair, (2) i being unpaired, (3) j being
unpaired, and even (4) bifurcation:
I
Init: ∀i = 1..|S|: Si,i = 0; ∀i = 1..(|S| − 1): Si,i+1 = 0
I
Termination: S1,|S| = max. number of base pairs
RNA structure prediction: Nussinov
RNA structure prediction: Nussinov
RNA structure prediction: Nussinov
RNA structure prediction: Nussinov
RNA structure prediction: Nussinov, example 1/5
RNA structure prediction: Nussinov, example 2/5
RNA structure prediction: Nussinov, example 3/5
RNA structure prediction: Nussinov, example 4/5
RNA structure prediction: Nussinov, example 5/5
Nussinov drawbacks
RNA structure prediction: MFE-folding
I
More realistic: thermodynamics and statistical mechanics.
I
Stability of an RNA secondary structure coincides with
thermodynamic stability
I
Quantified as the amount of free energy released/used by
forming base pairs.
Solution: minimizing the free energy
RNA structure prediction: MFE-folding
I
RNA molecules basically exist in a distribution of structures
rather than a single ground-state conformation.
I
“Most likely” conformation: structure exhibiting minimum of free
energy (MFE)
I
Energy contributions of different loop types have been measured.
Since free energies are additive, a more sophisticated model, the
standard energy model for RNA SS, can be proposed.
I
I
Based on loop decomposition, the total energy E of a structure S
can be computed as the sum over the energy contributions of
each constituent loop l:
X
E(S) =
E(l)
l∈S
MFE-folding, example
Structural elements, formal definition
Structural elements, formal definition
Structural elements, example: k-multiloop
Energies of secondary structure elements
Zuker algorithm
Zuker: recursions, loop decomposition, Vi,j
Zuker: recursions, loop decomposition, Vi,j
Zuker: recursions, multiloop handling, WMi,j
Zuker: recursions, multiloop handling, WMi,j
Zuker: recursions, assembly of sub-structures, Wi,j
Zuker: complexity
MFE-folding via loop decomposition: remarks
I
I
First proposed by Zuker, here RNAfold-version:
DP approach, 4 matrices used to score
I
I
I
I
the whole structure (F )
a single sub-structure (C) with a closing bp,
multi-loops (M), (M 1 )
M and M 1 are used to decompose a multi-loop “from the right”
(in a deterministic order)
MFE-folding via loop decomposition: recursions
MFE-folding via loop decomposition: remarks
I
Fi,j : free energy of the opt. sub-struct. on the sub-seq. x[i, j].
I
Ci,j : free energy of the opt. sub-struct. on the sub-seq. x[i, j]
given that i and j form a base pair.
I
Mi,j : free energy of the opt. sub-struct. on the sub-seq. x[i, j]
given that x[i, j] is part of a multi-loop and has at least one
“component”.
I
1
Mi,j
: free energy of the opt. sub-struct. on the sub-seq. x[i, j]
given that x[i, j] is part of a multi-loop and has exactly one
component which has the closing pair (i, h) for some h satisfying
i ≤ h < j.
MFE-folding via loop decomposition: recursions
I
Linear multiloop energies are assumed:
EML = a + b · degree + c · size
I
a: energy contribution of a closing base pair.
I
b: energy contribution of an interior base pair.
degree = no. of helices many (lecture: k )
I
c: energy contribution of an unpaired base size = no. of unpaired
bases (lecture: k 0 )
I
Energy of a hairpin closed by the base pair (i, j):
H(i, j) = a + (j − i − 1) · c
I
Energy of an interior loop determined by two constituent base
pairs (i, j) and (k, l):
I(i, j; k, l) = 2a + (k − i − 1)c + (j − l − 1)c
MFE-folding via loop decomposition
I
1
init: Fi,i = 0, Ci,j = ∞, M, i, j = ∞, and Mi,j
= ∞.
I
I
F1,n stores the energy value of the thermodynamically most
stable structure, its MFE.
Reversely following the path that yielded the MFE value, the MFE
structure can then be obtained by backtracking from F1,n to the
diagonal.
I
O(n2 ) memory
I
O(n4 ) time, but usually optimized to run in O(n3 ) by restriction of
interior loop sizes to some constant value (e.g. 30).
Literature
Nussinov-algorithm:
Algorithms for loop matchings
Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ
SIAM Journal on Applied mathematics. (1978); 35:(1) 68-82
Zuker-algorithm:
Optimal computer folding of large RNA sequences using thermodynamics
and auxiliary information Zuker M, Stiegler P
Nucleic Acids Res. (1981) Jan; 9:(1) 133-48
RNAfold:
Fast folding and comparison of RNA secondary structures
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P
Monatshefte für Chemie / Chemical Monthly. (1994); 125:(2) 167-188
Literature, back for good :-)
Download